Autocorrelation in Two Acts

Feb 04, 2025

A perennial difficulty in causal inference is “autocorrelation”. Things are like themselves. Things which are near other things are not independent observations, but are influenced by what is around. Not accounting for this will cause you to spuriously reject the null hypothesis, and think things which aren’t, are.

In the difference-in-differences literature, the first paper I know of suggesting methods for correcting for this is Bertrand, Duflo, and Mullainathan’s appropriately named 2004 paper “How Much Should We Trust Differences-in-Differences Estimates?”. Difference-in-differences, commonly called diff-in-diff, looks at the change in some variable after exposure to an event. To put it concretely, imagine that different counties adopt a law (the event) which is hypothesized to have an effect on wages (the outcome). We are making the argument that, were it not for the law, two counties were on the same track, and that the event was unrelated to expectations of future growth. The figure below shows what we’re doing. The “pre-trends” are the same, so the difference from what happened elsewhere can be attributed to the event.

This is conceptually similar to a regression-discontinuity design, but is more flexible. A regression-discontinuity is looking at the immediate response to a treatment — the “jump”, as it were. Diff-in-diff can account for policies having a gradual effect over time. This is quite natural and intuitive — a law might take a while to have an effect.

Measuring this is what gets us into trouble, though, and it was what Bertrand, Duflo and Mullainathan set out to correct. The treated objects — like counties and states — tend to have similar wage levels over time. If you take multiple measurements over time, you are spuriously inflating the same size. The observations are not independent, and reveal less new information. It would be as if you took a number of samples, and then simulated more data to match the prior observations. Nothing new is being revealed. In their paper, they create a hypothetical study involving the effect of state-level laws on wages. These laws do not exist, though. They are simply random times. Despite there being no actual laws, they reject the null hypothesis 45% of the time, when they should only be doing so 5% of the time.

They suggest two main methods for correcting this. First, you can take the average of all pre and post observations, and treat them as one big observation. You lose quite a lot of detail in doing this, though. If there is a definite change in trend, which takes a while to kick in, you would be better served observing the end and the end alone. You also lose the ability to use events where the laws are passed at different times, because there is no longer a “before” and “after” period at all. The other is called “block bootstrap”. Bootstrapping is essentially redrawing from our data set, with the block referring to keeping the observations from a given unit together.

Spatial Autocorrelation:

Difference-in-differences estimates have become a lot more trustworthy since then. There are still areas where we need to fully adapt, though. Autocorrelation exists in space too. Places which are close to each are similar in lots of ways, not all of which can be observed. It is obvious that if you have a study looking at effects of some event which happened in a place, dividing up the land into lots and lots of little units will lead us to reject the null hypothesis when we shouldn’t. The units of observation aren’t independent, so they don’t full count toward our sample size. We might reject the null hypothesis based on nothing by noise.

This was explored in the paper “The Standard Errors of Persistence”, by Timothy Conley and Morgan Kelly. The “persistence” papers are (perhaps were) a genre of papers tracing differences today to some historical trauma in the past. A model paper in the genre is Voigtländer and Voth (2012), which traces the origin of anti-semitism in the time of the Nazis to whether pogroms happened or not. They are able to show that places which had more pogroms during the Middle Ages were more anti-Semitic in the 1930s, as proxied for in a number of clever ways — violence during the 1920s, Nazi vote share, letters written to “Der Stürmer”, and so on. They can thus make a bigger argument that cultural traits are stable over time.

Conley and Kelly describe two basic reasons why persistence studies will get spurious results. The first was described above, where the N is lower than in practice than it appears. The other is the omnipresence of spatial trends. Imagine there were a plausibly exogenous east-west line drawn to divide a political boundary. If there is a north-south trend, independent of the political boundaries, you’re liable to show that the political division caused an effect, when it would simply be the trend. (This is an easier thing to account for. One can simply control for latitude or whatever you will, or you can have “placebo tests”, which would be drawing the line arbitrarily above or below the actual line, and seeing if that affects the results.)

The problem of spatial noise is much more serious. They generate totally synthetic data with autocorrelation, and reject the null hypothesis much more than is warranted. These errors are not just pushing a test statistic barely into significance. It is extremely common to have t-stats above 4 or 5 (where .95 significance would be a t-stat of 1.96).

To correct this, they propose a few solutions, each of which they test against synthetic data. First, use clustered standard errors (the so-called Conley corrections, one of the authors), which treat similar groups as one observation. This does not entirely reduce false positives to the appropriate level, but does help. Second, similar to block bootstrap, you reduce the data into blocks using k-means clustering (that is, you try and choose a set of points which reduces the mean distance of all your vectors. This is a computationally hard problem, but doable).

And I don’t think the persistence literature has been totally unaware of this. Melissa Dell’s 2010 paper on the Peruvian mining mita, for example, implicitly deals with a spatial trend through the study design. There is a border, and she compares long run outcomes of having one style of government or another; and since the border is a loop there is another border further north, with the directions reversed. Having controls for continent, latitude, and longitude are standard in the literature.

Even still, we must be very careful with autocorrelation in spatial data. Measurements must be independently drawn for them to fully count in determining significance. Make sure that you are not mislead!

Homo Economicus

Discussion about this post