Why I worry experimental social science is headed in the wrong direction

I joke with my graduate students they need to get as many technical skills as possible as PhD students because the moment they graduate it’s a slow decline into obsolescence. And of course by “joke” I mean “cry on the inside because it’s true”.

Take experiments. Every year the technical bar gets raised. Some days my field feels like an arms race to make each experiment more thorough and technically impressive, with more and more attention to formal theories, structural models, pre-analysis plans, and (most recently) multiple hypothesis testing. The list goes on. In part we push because want to do better work. Plus, how else to get published in the best places and earn the respect of your peers?

It seems to me that all of this is pushing social scientists to produce better quality experiments and more accurate answers. But it’s also raising the size and cost and time of any one experiment.

This should lead to fewer, better experiments. Good, right? I’m not sure. Fewer studies is a problem if you think that the generalizabilty of any one experiment is very small. What you want is many experiments in many places and people, which help triangulate an answer.

The funny thing is, after all that pickiness about getting the perfect causal result, we then apply it in the most unscientific way possible. One example is deworming. It’s only a slight exaggeration to say that one randomized trial on the shores of Lake Victoria in Kenya led some of the best development economists to argue we need to deworm the world. I make the same mistake all the time.

We are not exceptional. All of us—all humans—generalize from small samples of salient personal experiences. Social scientists do it with one or two papers. Usually ones they wrote themselves.

[Read the follow-up post here]

The latest thing that got me thinking in this vein is an amazing new paper by Alwyn Young. The brave masochist spent three years re-analyzing more than 50 experiments published in several major economics journals, and argues that more than half the regressions that claim statistically significant results don’t actually have them.

My first reaction was “This is amazingly cool and important.” My second reaction was “We are doomed.”

Here’s the abstract:

I follow R.A. Fisher’s The Design of Experiments, using randomization statistical inference to test the null hypothesis of no treatment effect in a comprehensive sample of 2003 regressions in 53 experimental papers drawn from the journals of the American Economic Association.

Randomization tests reduce the number of regression specifications with statistically significant treatment effects by 30 to 40 percent. An omnibus randomization test of overall experimental significance that incorporates all of the regressions in each paper finds that only 25 to 50 percent of experimental papers, depending upon the significance level and test, are able to reject the null of no treatment effect whatsoever. Bootstrap methods support and confirm these results.

The basic story is this. First, papers often look at more than one treatment and many outcomes. There are so many tests that some are bound to look statistically significant. What’s more, when you see a significant effect of a treatment on one outcome (like earnings), you are more likely to see a significant effect on related outcome (like consumption), and if you treat these like independent tests you overstate the significance of the results.

Second, the ordinary statistics most people use to estimate treatment effects are biased in favor of finding a result. When we cluster standard errors or make other corrections, we rely on assumptions that simply don’t apply to experimental samples.

One way to deal with this is something called Randomization Inference. You take your sample, with its actual outcomes. You engage in a thought experiment, where you randomly assign treatment thousands of times, and generate a treatment effect each time. Most of these imaginary randomizations will generate no significant treatment effect. Some will. You then look at your actual treatment effects, compare them to the distribution of potential treatment effects, and ask “what are the chances I would get these treatment effects by chance?”

RI has been around for a while, but very few experimenters in economics have adopted it. It’s more common but still unusual in political science. Here is Jed Friedman with a short intro. The main textbook is by Don Green and I recommend it to newcomers and oldcomers alike. I have been reading it all week and it’s a beautiful book.

Alwyn Young has very usefully asked what happens if we apply RI (and other methods) to existing papers. I don’t completely buy his conclusion that half the experiments are actually not statistically significant. Young analyzed almost 2000 regressions across 50 papers, or about 40 regressions per paper. Not all regressions are equal. Some outcomes we don’t expect treatment to affect, for instance. So Young’s tests are probably too stringent. Pre-analysis plans are designed to help fix this problem. But he has a good point. And work like this will raise the bar for experiments going forward.

But I don’t want to get into that. Rather, I want to talk about why this trend worries me.

  • I predict that, to get published in top journals, experimental papers are going to be expected to confront the multiple treatments and multiple outcomes problem head on.
  • This means that experiments starting today that do not tackle this issue will find it harder to get into major journals in five years.
  • I think this could mean that researchers are going to start to reduce the number of outcomes and treatment they plan to test, or at least prioritize some tests over others in pre-analysis plans.
  • I think it could also going to push experimenters to increase sample sizes, to be able to meet these more strenuous standards. If so, I’d expect this to reduce the quantity of field experiments that get done.
  • Experiments are probably the field’s most expensive kind of research, so any increase in demands for statistical power or technical improvements could have a disproportionately large effect on the number of experiments that get done.
  • This will probably put field experiments even further out of the reach of younger scholars or sole authors, pushing the field to larger and more team based work.
  • I also expect that higher standards will be disproportionately applied to experiments. So it some sense it will raise the bar for some work over others. Younger and junior scholars will have stronger incentives to do observational work.
  • On some level, this will make everyone more careful about what is and is not statistically significant. More precision is a good thing. But at what cost?
  • Well for one, I expect it to make experiments a little more rote and boring.
  • I can tell you from experience it is excruciating to polish these papers to the point that a top journal and its exacting referees will accept them. I appreciate the importance of this polish, but I have a hard time believing the current state is the optimal allocation of scholarly effort. The opportunity cost of time is huge.
  • Also, all of this is fighting over fairly ad hoc thresholds of statistical significance. Rather than think of this as “we’re applying a common standard to all our work more correctly”, you could instead think of this as “we’re elevating the bar for believing certain types of results over others”.
  • Finally, and most importantly to me, if you think that the generalizability of any one field experiment is low, then a large number of smaller but less precise experiments in different places is probably better than a smaller number of large, very precise studies.

There’s no problem here if you think that a large number of slightly biased studies are worse than a smaller number of unbiased and more precise studies. But I’m not sure that’s true. My bet is that it’s false. Meanwhile, the momentum of technical advance is  pushing us in the direction of fewer studies.

I don’t see a way to change the professional incentives. I think the answer so far has been “raise more money for experiments so that the profession will do more of them.” This is good. But surely there are better answers than just throwing more fuel on the fire.

Incentives for technical advances in external rather than just internal validity strike me as the best investment right now. Journal editors could play a role too, rewarding the study of scale ups and replications (effectiveness trials) as much as the new and counter-intuitive findings (efficacy trials).

Of course, every plea for academic change ends with “more money for us” and “journal editors should change their preferences.” This is a sign of either lazy or hopeless thinking. Or, in my case, both.

I welcome ideas from readers, because to me the danger is this: That all the effort to make experiments more transparent and accurate in the end instead limits how well we understand the world, and that a reliance on too few studies makes our theory and judgment and policy worse rather than better.

Update: In a follow-up post I round up the many comments and papers people shared.

260 Responses

  1. good that you confess to overgeneralize your own results, and to have written a lazy post full of hopeless thinking. I think you are right in the 3 points. As to the future of experiments, I hope collaboration with governments will increase the number of interventions that would happen anyway to get randomized, and that administrative data will more and more be used to track outcomes. Closer collaboration with governments will both decrease the cost of experiments and improve their implementing capacity. Stop maximizing the ranking of the journal where you get published and start maximizing the effectiveness of development interventions in the real world.

  2. I guess I am lucky I first did experiments 13 yes ago, when it was FUN. Cheaper experiments? Do it in cheaper countries, e.g one where the minimum wage is 200$/month. *email me*

  3. On the virtues of many underpowered studies – see what happened in social psych. It’s a disaster, especially when combined with file drawers and publication bias.

  4. RE: Randomization Inference. Peter Kennedy, who died unexpectedly 5 years ago, was beating this drum 20 years ago: see his “Randomization tests in econometrics”, Journal of Business & Economic Statistics; Jan 1995 (85-94), “Randomization Tests for Multiple Regression” (with BS Cade) Communications in Statistics- Simulation and computation (1996), 25:4, (pp. 923-929) or p. 69 of his Guide to Econometrics.

  5. I guess it would be opportune for me to point to the few hours old news of a RfP for LOW COST RCTs.
    The Laura and John Arnold Foundation (LJAF) is significantly expanding its investment in low-cost randomized controlled trials (RCTs) designed to build important evidence about “what works” in U.S. social spending. They are partnering again in this effort with the Annie E. Casey Foundation.
    http://www.arnoldfoundation.org/wp-content/uploads/Request-for-Proposals-Low-Cost-RCT-FINAL.pdf

  6. Interesting discussion. As a development practitioner but *not* an economist I also have worried that the increasingly high bar for these experiments (and therefore elevation of the few technically top-notch experiments) makes it challenging for us non-economists to interpret — i.e. how much should we change our approaches elsewhere based on that really interesting but very specific sounding experiment in rural Kenya? It’s hard to know or challenge in some cases.

  7. “A number of the comments have asserted that many underpowered studies are less accurate than a single large study. I regard this as an empirical question, ”

    Great post but I’m not sure that this bit is true, it seems like an econometric one to me, at least with regard to internal validity. If you take a meta-estimate (of some beta) from your low powered studies by averaging across them, then whether the meta-estimate is more precise than the One Big Study is gonna depend on the number of low power studies and the amount of extra power the big study has. You can compute the quality-quantity trade-off with a back-of-the-do-file estimate via simulation (or analytically I guess). Hell throw in a cost-of-producing-studies bit and you could chuck the whole trade-off into a Researcher Optimisation Problem. With fixed costs of projects then I’d guess the One Big Study is cheapest for the same power. Of course that ignores those tiny problems of external validity, multiple comparisons, and result selective publication …

  8. A number of the comments have asserted that many underpowered studies are less accurate than a single large study. I regard this as an empirical question, one that for all I know has been answered in medical science but has yet to find an answer in social science. Actually that’s not true. Angus Deaton’s critique of experiments is largely based on it’s fetish for internal validity and sloppy external validity. Lant Pritchett and Justin Sandefur have also advanced evidence that many potentially bias observational experiments are collectively more reliable than a handful of experiments. I’m not sure who is correct, and the answer probably has yet to arrive. But we invest very little as a profession in how to scientifically and reliably maximize out of sample relevance. So all this talk that few good studies are better strikes me as faith not fact.

  9. Doesn’t RI test the sharp null hypothesis? That is probably not appropriate for all these experiments?

  10. Here’s another study worth looking at: http://www.smithsonianmag.com/ist/?next=/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/

    I think that Chris Blatts has hit the nail on the head pointing out the problem with a focus on internal validity rather than external validity. Science is only about external validity: does the theory model reality. Internal validity assumes that methods which have proved useful in the past in generating external validity are a reliable indicator of external validity. But social studies of science seem to suggest that there is no “scientific method” which guarantees good results.

    To this I might add the following:

    In development, we are interested in furthering a development program of economic and social improvement sometimes called “the great divergence”. Historically, current modes of social science experimentation has played no role in development. The assumption that it will now is just that: an assumption.

    Another domain in which groups of people require complex decision-making in order to make real world impact on human populations is business. Business has proved spectacularly successful in so. These are made by leaders and managers synthesising information from a wide variety of sources, very little of it from social science.

    Finally, I recommend the statistician Nassim Nicholas Taleb for his insights on how even smartest people in the world, playing for extremely high stakes, with visible outcomes, manage to delude themselves with artifacts of statistical inference.

  11. Excellent post.

    Academia is just one piece in the puzzle. The continued progression towards precision within academia has its benefits. We learn a lot from the innovations in methods and careful scrutiny on internal validity that academic has brought to bear. But in a world of low generalizability, as you say, we also need much *more* evaluation to be able to make informed decisions on where the vast majority of money is flowing.

    The question is can one institution (“social science academia”) do both. We need the precision of those fighting to get into the best journals AND we need an army of high-quality “repeaters” that can answer the policy-relevant questions in a rigorous, but accessible (low-cost) way. It’s not clear that academia as an institution can create the right incentives to do both.

    Some initiatives are worth noting. The White House/OMB and the Coalition for Evidence-based Policy have been spearheading low-cost RCTs (http://coalition4evidence.org/low-cost-rct-competition/). IDinsight has been doing excellent work internationally, to use evidence to inform decision making. In one example, they did an multiple-treatment arm RCT to test different theories of how to best distribute bednets in Zambia. This is the perfect type of highly-relevant question but with (potentially) low generalizability. (www.idinsight.org)

    So maybe academia will have the answer, but maybe we should be on the lookout for the solution coming from outside academia as well.

  12. Exceptional post. I think there’s a tension between rigor and usefulness in applied social science research and the balance is tilted far too much to the former right now.

  13. In my own field of ecology, there’s been an explosion of interest in large distributed experiments. These are centrally-coordinated, collaborative projects in which the same small, inexpensive experiment is conducted simultaneously at many different sites around the world. The Nutrient Network (“NutNet”) project pioneered this approach in ecology: https://dynamicecology.wordpress.com/2011/10/20/thoughts-on-nutnet/ The hope is that by standardizing the design, you make the results more comparable across sites and cut down on confounding. And by conducting the experiment across many sites, you have a large total sample size, plus you can assess among-site variation in treatment effects. Finally, because the experiments concerned are fairly small and cheap, they don’t have large opportunity costs for any individual investigator.

    I’m curious: is there any interest in such distributed experiments in your own field?

    Your posts raises a range of issues,of course, I recognize that distributed experiments only address some of them.

  14. I’m glad you raised the point. Science is just one way of observing the world – it’s through fragmentation, specialization and precision of observing the nature/the world, and therefore it won’t be able to understand the intricate relationship of nature/world, and won’t be able to understand the nature or the world holistically.

  15. I love you posting this b/c it takes guts to defend more/quicker research in the face of quality concerns and while many people agree w/ you they cower in the face of a simple quality argument. But, I still disagree. Promoting proliferation ignores that the world simply does NOT focus on external validity; I just don’t have confidence that we’ll reverse the focus on one study wonders towards a greater focus on generalizability b/c the incentives just aren’t there where bugu bucks are spent, like bank, bilaterals, etc.. Also, having seen a lot of poorly designed studies it’s not just about crappy statistical jujitsu but also some really crappy data collection and projections extrapolating from that data collection. I don’t see much attention to the data collection but that was what has given me most heartache and questions when working on studies.

    All to say, I’d much rather we do less studies really really flipping well on data and stats. AND when the question we’re asking and the use to which we intend to put the study points to RCT/quasi-exper as the best method and worth the time, investment, and typical implications it has for projects.

  16. I’m going to make the cardinal sin of commenting before I read the actual paper, but I think you’re overreacting to the “multiple testing” aspect. There’s a large literature on how to deal with these issues in clinical trials and you can recover power through the inexpensive approach of planning your data analysis in advance, testing hypotheses in order of importance, and stopping the first time you fail to reject. See [1,2]. There are variations that allow more nuanced inference as well.

    The benefit is that you get the same power for the most important hypothesis that you would get if you ignore multiplicity, so sample sizes don’t need to change.

    But I definitely agree that the same standards (or higher) should be applied to observational studies as well. Just because these issues are more transparent in experiments doesn’t mean that they’re magically resolved elsewhere.

    [1]: http://biomet.oxfordjournals.org/content/95/1/248.abstract
    [2]: https://projecteuclid.org/euclid.aos/1291126973

  17. My worry with the preponderance of RCTs is that we are getting better and better at getting answers to the wrong questions. A world in which more researchers are taking on really good, observational work and a smaller number are doing really good experimental work doesn’t sound to me like such a bad thing.

  18. Loved the post (I’m an old student from Yale). I don’t see the issue in pre-specifying which tests are more important than others (some have a chance to be “significant” and others just “post-hoc.”) I work running business experiments for a tech company and this is what we currently do (although it’s still early days for us).

    If causality determinations have always been over-applied, I don’t see why that’d stop just because certain findings are post-hoc. In other words, you could continue to learn from findings that are now more accurately portrayed as post-hoc. More than that, I think pushing more results to be classified as post-hoc has very positive incentives–it pushes you to really think about your theory of change in pre-specifying which outcome is going to be the main one and it pushes you to replicate experiments if you want to declare a ‘post-hoc’ effect a significant one (a post-hoc findings becomes exploratory research for a confirmatory experiment).

    If you use the term “significance,” it means something about the likelihood of observing the data by chance (even if you don’t think that’s an important something). Statistical corrections for multiple hypothesis testing are a more accurate representation of ‘significance.’ If you don’t pre-specify a test, all of them fail to meet a more rigorous standard of significance.

  19. 1. One big study you can LEARN from, regardless of the outcome, is better than 100 underpowered studies that you throw into the drawer if you cannot find an effect.

    2. Collecting more variables can be better if it will be accompanied with the right statistical tools for drawing conclusions. There has been a lot of progress in Machine Learning that deals with fining predictive power without over-fitting the data

  20. Two thoughts: Every datapoint has an opportunity cost that is rarely considered; and while the general perception is that flawed data are better then no data, I remain unconvinced. The mistake there is thinking you know something about whatever is being examined, when in reality it isn’t so. I am sure there’s a beautiful Bayesian framework to incorporate that prior flawed knowledge, but…..what if you don’t know how deeply flawed it is?

  21. 1. I gather from other blogs that this argument is playing out in exactly the same way in biology and genetics. Some argue, if we have to do properly powered studies, then only big labs at R1s will be able to do them. Others say, tough.

    2. I believe what you call “lots of small [underpowered] studies,” Andrew Gelman (for example) would call “chasing noise,” and he has given general reasons to think that, yes, it can be worse than nothing.