"Robustness" Is Not Replication
We must be clear about what we are doing
You may have heard about the replication crisis. Lots of shoddy science, especially in the softer field like psychology, was being published as though nothing were wrong. People were building edifices of theory and follow-up work upon effects which didn’t actually exist.
When people refer to “the replication crisis” in science, there are several different things which are being bundled together. First, there is the most literal sort of replication imaginable, which we might call reproduction. In your paper, you used some collection of data and some methods to reach some results. If someone else takes your data and your methods, they should arrive at the same results. Ideally, everything should be executable by downloading everything, and pressing “go”. Relatedly, one might have intended to do something, but inadvertently programmed something else. The code you provide does work, but it didn’t do the things which you said it was doing. Both of these are obviously extremely bad.
Second, there are questions of generalizability. If you take some experiment which a scientist conducted, you would certainly hope that the effect happens again. It will not always happen again, because of randomness, but it should happen more often than by chance. This is often what psychologists mean when they refer to replication, as their work can be more easily done again by others in a lab. Economists don’t really care quite as much about this, because our work is rarely on small things which could be implemented easily by another. We generally work with observational data, and can’t be sure that something we discover should generalize or not generalize to another sample.
Third, there are questions of robustness. When we write down a model of the world, we are making some claims about how the world works. We cannot include everything, because we lack perfect knowledge. Instead, we want to find the minimum number of causal factors generating the observed data, and estimate them.
I think all three of these get lumped together in a way that is unhelpful. They are each of extremely different severity. A failure of a paper to replicate, insofar as either the code does not work or the code is bugged, is of extreme importance and threatens the paper’s existence. This has gotten better, although it still needs a lot of work. I expect it to continue getting better because of AI – having it scan through to essentially debug code is something I expect it to be very good at.
A failure to replicate in subsequent experiments is a plausible sign that the effect isn’t real. The study authors may have chosen to shelve uninteresting results, or even committed fraud. To the extent that researchers have discretion over what data they choose to analyze and what results they present, you will not be able to get the real effect simply from summing up the literature. I would say that this is bad.
Meanwhile, a failure to be robust to different assumptions is neither a good thing nor a bad thing. In fact, a good paper *should not* be robust. There must be assumptions whose alteration will alter the results. If it did not, the paper would be uncovering something mechanical.
Robustness checks are about two things. Suppose that there are several options one could choose in examining their data. If all of them seem equally good, and the ones the researcher chose give us a significant result, then that is a sign that they have not actually chosen the best assumptions, simply the one that delivers the results they want.
On the other hand, the assumption might be both load-bearing and reasonable. Breaking it, and showing that the results change, isn’t interesting. Instead, one has to argue that the assumption is unreasonable. There is a wide range of things which could be meddled with. Some are things which should be obviously irrelevant, like the size of units
This is why I am not over the moon about projects like “The Robustness Reproducibility of the American Economic Review”. They chose the correct target – they only look at papers which rely on reduced form estimation, i.e. linear regressions with various choices over what to control for and what to include. They do not include structural identification papers, which would be an exaggerated form of the problem I am talking about – everyone agrees that all structural models are misspecified, and they always will be, but nevertheless we have to do something and this is the best we got. If you go into a model, and start changing assumptions, of course the results should change. That’s what the assumptions are for.
It is also possible for robustness checks to be bad science, and for the original to be good. I am in no way accusing the previously cited paper of this, but rather have in mind the “worm wars” of 2015, surveyed here. “Worms” is a 2004 paper by Michael Kremer and Ted Miguel on the delivery of deworming medication in Kenya. Because there is an element of herd immunity (you can’t catch it from your dewormed classmates) they delivered it to randomized villages, and because of the large scale of the program, they delivered it over two years. They found that there was a substantial effect on later school attendance and then income – hurray!
Later, a paper checking the robustness came out, and garnered quite a lot of media attention. A big study isn’t robust – oh no! But here’s the thing – it turned out that the replication authors had done everything which a corner-cutting academic might do in order to get a significant result, but in reverse. At every possible step, they tried to make the study underpowered to discover any effect, and cut up the data in the way that most undermined the result. The case for deworming improving outcomes was, in fact, strengthened by their prodding, which showed how hard it was to overturn the basic result.
There is basically no way, in my view, to test robustness at scale. It is going to be a house-by-house fight, where the particulars of each paper matter. Simply saying that it “isn’t robust” is not enough – one must make the case for each. Neither is it enough to say that if you flip on and off all the different specification choices, the results range from A to Z. One of those specification choices is the correct choice. Our job is to find the correct choice of specifications, not to see how many changes a result is robust to. A result can be completely correct, and yet not be robust to even small changes; conversely, a result can be robust to many different changes, and yet be wrong. We need to understand what we are doing when we talk about replication.
