03/31/2016 02:23 pm ET Updated Dec 06, 2017

Is All That Glitters Gold-Standard?


In the world of experimental design, studies in which students, classes, or schools are assigned at random to experimental or control treatments (randomized clinical trials or RCTs) are often referred to as meeting the "gold standard." Programs with at least one randomized study with a statistically significant positive outcome on an important measure qualify as having "strong evidence of effectiveness" under the definitions in the Every Student Succeeds Act (ESSA). RCTs virtually eliminate selection bias in experiments. That is, readers don't have to worry that the teachers using an experimental program might have already been better or more motivated than those who were in the control group. Yet even RCTs can have such serious flaws as to call their outcomes into question.

A recent article by distinguished researchers Alan Ginsburg and Marshall Smith severely calls into question every single elementary and secondary math study accepted by the What Works Clearinghouse (WWC) as "meeting standards without reservations," which in practice requires a randomized experiment. If they were right, then the whole concept of gold-standard randomized evaluations would go out the window, because the same concerns would apply to all subjects, not just math.

Fortunately, Ginsburg & Smith are mostly wrong. They identify, and then discard, 27 studies accepted by the WWC. In my view, they are right about five. They raise some useful issues about the rest, but not damning ones.

The one area in which I fully agree with Ginsburg & Smith (G&S henceforth) relates to studies that use measures made by the researchers. In a recent paper with Alan Cheung and an earlier one with Nancy Madden, I reported that use of researcher-made tests resulted in greatly overstated effect sizes. Neither WWC nor ESSA should accept such measures.

From this point however, G&S are overly critical. First, they reject all studies in which the developer was one of the report authors. However, the U.S. Department of Education has been requiring third-party evaluations in its larger grants for more than a decade. This is true in IES, i3, and NSF (scale-up) grants for example, and in England's Education Endowment Foundation (EEF). A developer may be listed as an author, but it's been a long time since a developer could get his or her thumb on the scale in federally-funded research. Even studies funded by publishers are almost universally using third-party evaluators.

G&S complain that 25 of 27 studies evaluated programs in their first year, compromising fidelity. This is indeed a problem, but it can only affect outcomes in a negative direction. Programs showing positive outcomes in their first year may be particularly powerful.

G&S express concern that half of studies did not state what curriculum the control group was using. This would be nice to know, but does not invalidate a study.

G&S complain that in many cases the amount of instructional time for the experimental group was greater than that for the control group. This could be a problem, but given the findings of research on allocated time, it is unlikely that time alone makes much of a difference in math learning. It may be more sensible to see extra time as a question of cost-effectiveness. Did 30 extra minutes of math per day implementing Program X justify the costs of Program X, including the cost of adding the time? Future studies might evaluate the value added of 30 extra minutes doing ordinary instruction, but does anyone expect this to be a large impact?

Finally, G&S complain that most curricula used in WWC-accepted RCTs are outdated. This could be a serious concern, especially as common core and other college- and career-ready standards are adopted in most states. However, recall that at the time RCTs are done, the experimental and the control groups were subject to the same standards, so if the experimental group did better, it is worth considering as an innovation. The reality is that any program in active dissemination must update its content to meet new standards. A program proven effective before common core and then updated to align with common core standards is not proven for certain to improve common core outcomes, for example, but it is a very good bet. A school or district considering adopting a given proven program might well check to see that it meets current standards, but it would be self-defeating and unnecessary to demand that every program re-prove its effectiveness every time standards change.

Randomized experiments in education are not perfect (neither are randomized experiments in medicine or other fields). However, they provide the best evidence science knows how to produce on the effectiveness of innovations. It is entirely legitimate to raise issues about RCTs, as Ginsburg & Smith do, but rejecting what we do know until perfection is achieved would cut off the best avenue we have for progress toward solid, scientifically defensible reform in our schools.