The Problem with Meta-Analyses
And the appeal of the One Big Study
Suppose we want to identify the effect of some treatment – be it a medicine, a government program, or a school. The ideal method of identification is one where we randomize the participants. This way we can ensure that the only difference, on average, between the treated and untreated group is the treatment itself, and thus any difference is exclusively due to it. Unfortunately, we cannot always get the ideal experiment. Recruiting lots of participants is expensive, and it limits our ability to detect real effects.
A meta-analysis is a way of getting around these restrictions. Rather than have to rely upon the limited precision of any given study, we can simply combine them. Specifically, we boil down the effect of the treatment to a single parameter, which we can then average, weighted by its precision and thus the size of the sample. (I want to cite meta-analysis.cz prominently here, as it is the best resource on the topic).
Despite its promise, I believe this method fails in practice. The conditions under which we can aggregate the studies are too restrictive. We start with a prior belief about the effect of some program. As we receive new information from experiments about the effectiveness of the program, we shift our belief part of the way to the effect in the study. Because there is only one thing we care about, we are shifting along one dimension. If an experiment is testing something similar, but not quite the same, as another experiment, we are shifting along another dimension. Because we don’t know how close to parallel they are, we have no idea how much to shift our beliefs. There is no obvious way in which we can add up the effects from UBI experiments in Compton and Dallas and Chicago to determine the effects of UBI in Virginia, much less how we should add together experiments testing things which are almost but not quite UBI, like unconditional cash transfers to poor families in Uganda.
This is especially acute in economics. If there are multiple studies of a medication over time, it is reasonable to think that these will all have the same effect. Even if the studies differ in some particulars, we might still be able to pool the experiments. (For example, we could weight the effect size by the dose.) Not so in economics. We would expect the effects of different policies in different places and times to vary substantially, and we are much less able to run experiments at all, often having to make do with natural experiments.
Second, we cannot trust studies unconditionally, especially as we get away from the top economics journals. (However bad you think economics is at making believable, reproducible research, I assure you that the rest is so much worse). All of the studies going in are at most weighted by sample. All it could take is one shabby or fraudulent study to muck up the whole thing. No one reading it is seriously going and checking all of the studies except in the most egregious cases. Nor can we rely upon the author to check the quality of all the studies which go. The power to boot studies for vague reasons is also the power to boot studies for disagreeing with the sought after effect sizes.
Additionally, whether or not something is published is correlated with whether it finds significant results. Failing to correct for this will bias your results. This is something which researchers have gotten much better on. Weighting by precision means that the larger, less precise estimates which get published are downweighted, and it can even improve the accuracy of the data to drop 90% of it. I don’t want to harp on this too much, because I do generally believe the corrections in an idealized case. What complicates it is that once the studies are no longer measuring the effect of the same thing, we cannot make any assumptions about the distribution of data. A study showing what appears to be strong publication bias need not be evidence of anything.
I will also editorialize slightly – the vast majority of meta-analyses are terrible. As this survey article notes, there were 107,000 different meta-analyses published in 2022. Half of them don’t discuss publication bias. They are so popular because they can totally replace independent thought. They’re the ideal thing for an academic who has to publish something – anything – to churn out. If we are relying upon them to rigorously filter and consider the importance of different studies, we’ve already lost.
Economists aren’t entirely unaware of these problems. T. D. Stanley (2001) has a quite readable piece in the JEP on this. What one can do is break down the methodological choices in a study, and make claims about how important each methodological choice is. This is only useful, though, when we are all effectively studying the same thing with different methods. It is largely unhelpful when we are evaluating similar policies in different places and times, especially since applying fixed effects for those things would return you to having N of 1 studies you don’t know how to combine.
What I would instead prefer is One Big Paper. Everyone can read that One Big Paper, argue about it, and eventually decide that the case it makes really is airtight. One will still have to use their judgement in combining evidence from multiple sources, but I think this judgement is better than just chucking a pile of studies in the blender and seeing what point estimate gets spit out. What this looks like in practice is everyone cites Chetty and Hendren (2018), and doesn’t try and recreate Chetty and Hendren in the aggregate by citing 30 different studies each studying one small part of what they studied.
Plus, the trouble with meta-analyses is that you won’t actually have one of them. For any sufficiently big question, you’ll have multiple meta-analyses, all differing in which studies they include and how they weight them. I came across this paper from Spencer Banzhaf (2021), who notes with dismay that to summarize the 800 unique estimates of the statistical value of a human life, there are now five meta-analyses. Nevertheless, he has a solution – his paper is a meta-analysis of all the meta-analyses. In 20 years, his paper shall be fodder for a future generation’s meta-meta-meta-analysis.
A brief aside: I am less skeptical of meta-analyses which evaluate some parameter for estimating a macroeconomic model, however. This might be a bunch of studies testing what the elasticity of labor supply is, or perhaps what people’s discount rates are. We have to bow to practicality here. The parameter isn’t what we are interested in, in and of itself. If you disagree with the parameter, the macroeconomist will inevitably have to show how their results vary with different calibrations.
I don’t think that meta-analyses are a replacement in the business of program evaluation for knowing the literature, knowing which studies are the good ones and why, and intuitively weighting disparate programs. A meta-analysis serves mainly to lend false precision to a fuzzy question.

If I understood the main point of your article correctly, you are arguing that there is more to external validity than merely aggregating a bunch of results together. This is clearly true. But there is a general prescription for transporting results between heterogeneous populations as long as you have a decent causal model of the underlying mechanism: https://arxiv.org/abs/1503.01603 If anything this is one more reason for doing economics with DAGs
>Second, we cannot trust studies unconditionally, especially as we get away from the top economics journals.
Is it even true that you can trust the top journals more? The top journals are more likely to interesting and novel methods and results which are more likely to be wrong.
In the social sciences other than Econ it seems to be the case that top journals replicate less although I think economists are likely a bit better.
https://www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2018.00037/full