04/02/2007 03:55 pm ET | Updated May 25, 2011

Correlation and its Discontents

A couple of commentators have expressed an interest in discussion of correlation vs. causation, so here goes.

The human brain is wired to see connections, to see correlations among events. In terms of the survival of the species, this is no doubt a good thing. Without such capability, we'd probably still be trying to figure out where babies come from or might have died out because we didn't notice the relationship between the presence of saber-toothed tigers and the disappearance of our buddies.

Sometimes, though, we see connections that aren't there: If I wear a certain color shirt my favorite basketball team will win the NCAA championship, for instance. B. F. Skinner famously showed that pigeons would develop "superstitious" behaviors because they saw a link between what they were doing and when food arrived. In reality, the food was delivered on a schedule independent of the pigeons' actions. I don't include any athletic "tics"--pounding home plate X number of times, not shaving during the World Series, etc.--because these actions might actually help an athlete get "set" for the game.

Social science research makes much use of a statistic called a "correlation coefficient," a number that describes the strength of the relationship between two variables. Perhaps the best known of these correlations is that between SAT scores and college freshman grade point average, about +.45.

What does ".45" mean? It means that the higher a student scores on the SAT, the higher the student will probably score in college courses (the real operative word, of course, is "probably"). It means that the relationship between the two variables is not zero, but that it is not perfect, either. If the relationship were perfect and positive, the coefficient would be at +1.00 (if it were perfect and negative, it would be at -1.00). That is, if I had only the SAT score, I could perfectly predict the freshman college grade point (freshman is an operative word, too; the relationship drops each successive year). Too many people forget or ignore is that the SAT is only one factor in the admissions decision and at most schools not a large one. Brown University, for example, could admit two freshmen classes just with students scoring between 750 and 800 (highest possible score) on the SAT-Verbal. It admits only about one third of these high scorers.

We know that there is a correlation between poverty and success in school, between absenteeism and later dropping out, between teacher experience and student achievement. But to make any causal statements, we'd need to look, say, at how experienced teachers differ from novices.

The fact is, any two variables can be correlated. Whether the resulting coefficient is meaningful or not is another issue. I can correlate waist size or the distance between eyes with college grades. Meaningful? Likely not. Before everyone started wearing jeans, there was a correlation between the health of the economy and skirt length: Short skirts meant good times, long skirts indicated recessions. As far as I know, no one ever advocated raising hemlines as a means of re-establishing prosperity. But, given only a correlation between A & B, it makes as much sense to say A causes B as that B causes A (and, actually, the relationship between A & B might be determined by a third variable or it might just be a fluke).

To make a causal statement, additional information must be adduced. For many years the tobacco industry argued that the correlation between smoking and lung cancer was just that, a correlation. But 28 Surgeon General's reports bringing in additional evidence about the life spans and cancer incidence of people who once smoked but stopped led to an inescapable causal conclusion.

Back to the SAT and grades: How much importance should admissions officers attach to a correlation of .45. Not very much. The percent of variability in one variable (grade point) explained by another variable (SAT) is given by the square of the correlation coefficient (too statistically detailed to deal with here). The square of .45 is .2025. The SAT accounts for 20.25% of the variability in grades. That means 80% of the variability comes from other sources--motivation, number of challenging courses taken in high school, percent of time spent playing bridge or dating or binge drinking, hours hitting the books, luck, etc.

We've only touched on the correlation between two variables and are almost out of space, but you can use multiple variables to increase the accuracy of your predictions. For example, most colleges use a multiple correlation that combines high school grade point average, high school rank in class, and SAT to predict college grades. Some also rank the difficulty of an applicant's high school curriculum, adding a fourth variable into the prediction equation.

Just watch out for causal statements made from correlations.