Why Rigorous Studies Get Smaller Effect Sizes


When I was a kid, I was a big fan of the hapless Washington Senators. They were awful. Year after year, they were dead last in the American League. They were the sort of team that builds diehard fans not despite but because of their hopelessness. Every once in a while, kids I knew would snap under the pressure and start rooting for the Baltimore Orioles. We shunned them forever, right up to this day.

With the Senators, any reason for hope was prized, and we were all very excited when some hotshot batter was brought up from the minor leagues. But they almost always got whammed, sent back down or traded but never heard from again. I'm sure this happens in every team. In fact, I just saw an actual study comparing batting averages for batters in their last year in the minors to their first year in the majors. The difference was dramatic. In the majors, the very same batters had much lower averages. The impact was equivalent to an effect size of -0.70. That's huge. I'd call this effect the Curse of the Major Leagues.

Why am I carrying on about baseball? I think it provides an analogy to explain why large, randomized experiments in education have characteristically lower effect sizes than experiments that are quasi-experiments, smaller, or (especially) both.

In baseball, batting averages decline because the competition is tougher. The pitchers are faster, the fielders are better, and maybe the minor league parks are smaller, I don't know. In education, large randomized experiments are tougher competition, too. Randomized experiments are tougher because the experimenter doesn't get the benefit of self-selection by the schools or teachers choosing the program. In a randomized experiment everyone has to start fresh at the beginning of the study, so the experimenter does not get the benefit of working with teachers who may already be experienced in the experimental program.

In larger studies, the experimenter has more difficulty controlling every variable to ensure high-quality implementation. Large studies are more likely to use standardized tests rather than researcher-made tests. If these are state tests used for accountability, the control group can be assumed to be trying just as much as the experimental group to improve students' scores on the objectives taught on those tests.

What these problems mean is that when a program is evaluated in a large randomized study, and the results are significantly positive, this is cause for real celebration because the program had to overcome much tougher competition. The successful program is far more likely to work in realistic settings at serious scale because it has been tested under more life-like conditions. Other experimental designs are also valuable, of course, if only because they act like the minor leagues, nurturing promising prospects and then sending the best to the majors where their mettle will really be tested. In a way, this is exactly the tiered evidence strategy used in Investing in Innovation (i3) and in the Institute for Education Sciences (IES) Goal 2-3-4 progression. In both cases, smaller grants are made available for development projects, which are nurtured and, if they show promise, may be funded at a higher level and sent to the majors (validation, scale-up) for rigorous, large-scale evaluation.

The Curse of the Major Leagues was really just the product of a system for fairly and efficiently bringing the best players into the major leagues. The same idea is the brightest hope we have for offering schools throughout the U.S. the very best instructional programs on a meaningful scale. After all those years rooting for the Washington Senators, I'm delighted to see something really powerful coming from our actual Senators in Washington. And I don't mean baseball!

This blog is sponsored by the Laura and John Arnold Foundation