01/23/2015 03:27 pm ET Updated Mar 25, 2015

Randomized Controlled Trials: Powerful, But Only When Used Right

There has been a recent surge in the use of randomized controlled trials (RCTs) in the social sciences. With topics ranging from financial education and tax repayment to organ donation, studies where individuals are (often unknowingly) randomly allocated to different intervention conditions - and then compared to each other - have become increasingly popular. Many researchers, myself included, have heralded this movement as a much-needed development in social science research, as RCTs have the power to generate new insights, especially on the application side of the social sciences. However, the design of RCTs often prevents important questions from being answered while seeming to suggest that they are bringing us closer to the truth. Here, I want to highlight three elements of RCTs that often go unnoticed and that can be useful to consider when evaluating these types of studies (sparked by this recent NYT article): a need for more sophisticated control groups, more careful consideration of other affected behaviors, and better conceptualization of long-term effects.

Although RCTs first appeared in psychology, they were popularized by medicinal research and are today conceived as the "gold standard" for a clinical trial. Conceptually, an RCT is a study where people are randomly allocated to one or more intervention conditions. Thus, it is possible to compare the effectiveness of different interventions while minimizing both known and unknown factors that may influence the possible outcome under investigation. Recently, there has been a return to the use of RCTs in the social sciences. Much of this has been inspired by the rising influence of behavioral approaches to understanding societal issues. Books such as Poor Economics, Nudge, and The Why Axis demonstrate the tremendous potential of cheap and simple interventions to bring about scalable positive changes, often tested via RCTs. Governments around the world are also increasingly relying on the power of RCTs to make policy-relevant decisions. In fact, a former administrator of the White House Office of Information and Regulatory Affairs, Cass Sunstein, made pioneering attempts to introduce the usage of RCTs to make policy decisions into the US government (as described in his book Simpler). To date, over 240 RCTs have been conducted in education research alone, often with great success. For example, randomly assigning some high school students to receive automatic, personalized text messages that reminded them about their college application deadlines increased college enrollment to 70%, compared to 63% in a similar group of students randomly assigned to not receive the messages. This gain is similar to that provided by scholarships, but at only a fraction of the price. In another study, sending parents of middle and high school students personalized text messages when their children did not hand in homework assignments increased homework completion by 25% relative to the homework completion of children whose parents did not receive such messages and made parents twice as likely to reach out to their children's teachers. In light of such dramatic increases in important behaviors at such low costs, most - if not all - commentators view these types of studies as necessary and are calling to increase these studies' use. I am fully with them, but we need a better understanding of what RCTs can and cannot do. We also need to be careful to design RCTs appropriately to fully answer the problem at hand. Let me focus on the latter point here: Do RCTs live up to the great expectations placed on them? And if not, as I will argue, how can we design them to be more powerful?

One of the key elements of an RCT is the comparison between an intervention group and a control group. That is, when researchers hypothesize that one group is going to benefit from a given intervention, they compare the behavior of that intervention group to that of a group that did not receive the intervention. Although this rationale carries intuitive appeal, it overlooks the fact that a difference from the no-intervention group does not necessarily suggest that the intervention was successful. Take the case of medicinal research, where the intervention group is often instead compared to a placebo (a sugar pill) because extensive research shows that merely giving a patient a pill carries positive intervention effects. In fact, much medicinal research goes a step further, additionally comparing an intervention condition to the best currently known intervention. If an intervention remains better than these two control conditions, then one can be much more confident in claiming that it truly is effective. A no-intervention condition is therefore not a good control condition on its own. In particular, to be more certain about the validity of an approach, the intervention condition should be compared to three adequate control conditions, consisting of a no-intervention group, a placebo group, and a best-currently-available-intervention group.

When designing RCTs, researchers are often careful to make sure that they adequately measure the downstream behavior they aim to influence. For example, when designing a study to increase college enrollment, measure college enrollment; when designing a study to increase homework completion, measure homework completion. This idea mirrors the design of clinical trials, where the patients' outcome is evaluated using the same clinically relevant parameters that the chemical compound is hypothesized to affect. Unfortunately, social science research works differently: it requires a better understanding of the underlying psychology at play. As the previous examples highlight, social science RCTs often take the form of reminders that bring the focal element of interest to the front of one's mind. Researchers' (often implicit) theory is that by making individuals pay more attention to something, they will in turn be more likely to consider it as important and act upon it. Attention, however, is limited, which implies that bringing something to the front of one's mind necessarily means devoting less attention to other thoughts, thus potentially reducing other behaviors. Sometimes, this can be good, including when reminders bring college enrollment deadlines to the front of a student's mind at the cost of less important thoughts, such as "What am I going to wear to senior prom?" At other times, however, a reminder may prompt the devotion of less thought to other important elements, such as when high school students focus on completing their homework at the cost of thinking about college enrollment. Consequently, we need to think more about what other behaviors may be affected by an intervention and then determine the appropriate design to best measure the effects. Otherwise, by looking at just one isolated behavior, we may be missing the bigger picture. A good RCT therefore must demonstrate an increase in the effectiveness of the focal behavior, with no reduction in the frequency of other important behaviors.

RCTs need to have a defined study period in order to be designed and appropriately run. The duration of RCTs varies frequently, with some trials lasting just a few weeks or months, whereas others take years. Researchers track the development of the behavior of interest over time to be able to make claims about the long-term effectiveness of the intervention they introduced. This is important, as an intervention is only really effective if its effects do not fade away over time. Unfortunately, many RCTs are often too short to make adequate claims about the long term, mostly due to cost concerns. Imagine receiving a text every time your child's homework assignment is late; in the beginning, you may perceive the texts as a helpful notice, and they may prompt you to change your interaction with your child. But will those text messages continue to prompt conversations one year down the line? At what point will you ignore them? It is difficult to answer these questions with an RCT that does not last long enough, and even so, how long is long enough? Six months may be an adequate timeframe to evaluate the long-term effectiveness of receiving messaging about your children's late homework, but there seems to be a decrease in intervention effectiveness over time (see Figures 5-8, pp. 42-43). An even more important point is whether the effectiveness holds when the intervention is no longer administered. Does sending text messages change habits over the long term, such that parents will more frequently engage with their children? These questions could be answered by tracking former study participants over the long term - yet this is rarely done. Following the example of clinical trials, where it is common practice to check for long-term side effects, is warranted here. Whereas cost concerns explain why RCTs are often too short, it is not that expensive to track participants over time, after the intervention has finished, especially in light of the possible benefits such a design could provide.

RCTs are important. They are an outstanding tool that researchers have to obtain better insights. However, RCTs need to be designed better: we need more intervention arms, better control groups, better definition and measurement of other possibly affected behaviors, and longitudinal designs including follow-ups to evaluate long-term effectiveness. This is not optional; rather, it is mandatory when RCTs are being used to make policy-relevant decisions, as they increasingly are. Some organizations, such as ideas42, the Behavioral Insights Team and J-PAL are showing that this type of design is possible. Although it may take more effort to design RCTs and more money to run them, it is essential for RCTs to be designed better in order to live up to their claims as a powerful methodology against many societal ills.

For more information, please contact the author at