02/09/2007 11:22 pm ET | Updated May 25, 2011

The Insignificance of Significance

The January, 2007 National School Boards Association document, "More Than a Horse Race: A Guide to International Comparisons," contains this sentence: "Throughout this guide the term significant differences always refers to a statistically significant difference meaning the differences between scores are meaningful and did not happen by chance" (emphasis in the original).

If I had any hair, I'd be pulling it out. The Guide doesn't get statistical significance. Many people don't. To render that sentence correct, we'd have to drop any contention about meaningfulness and state that the differences "likely did not happen by chance." A statement about statistical significance is always a statement of probability, not meaningfulness. .

If, say, in an international reading comparison England scored 553 and the US scored 542, a test of statistical significance answers this question: How likely is it that a difference as large as the difference seen (in this case 11 points, 553-542) could have happened when there really is no difference between England and the US. Or, as statisticians like to say, when the two samples came from populations with the same means (at some point, the blog will have to deal with samples and populations, too).

If it's unlikely, then the difference is statistically significant. If it's not unlikely, then it's not significant. How unlikely does an outcome have to be to garner the label "significant?" Although cranking the numbers through some algorithm of a statistical procedure is a mechanical process, researchers can call anything they want significant. Certain conventions, though, prevail. Many researchers will consider a difference as significant if it were likely to happen by chance less than 1 time in 20. I prefer less than 1 time in 100, or even less, but some researchers lately have been calling significant results that would happen by chance less than 1 time in 10.

So there is always some probability that the two samples came from populations with the same mean and so there's always some possibility that when you say the groups are different, you are wrong.

The odds of finding a statistically significant result increase as the groups being compared get larger. Let's say we compare two beginning reading curricula and a reading test showed that those who learned with Curriculum A scored X points higher than those who learned with Curriculum B. It would take a much smaller difference to be significant if that difference of X points came from two samples of 50,000 students each rather than two samples of 25 students each (the determination that A is better than B presumes the test represented both curricula evenly, that the teachers of A and B were equally effective, and that the two groups of students did not differ on any salient variables before we taught them how to read).

We will be more confident with large samples, but many tests for statistical significance were designed for small samples. With large samples, as in international comparisons, tiny differences can be significant. That 11-point England-US difference is from a real study. It is statistically significant. But looked at in raw scores it means that English kids, on average, got about 2 items more correct than US kids. Again, a statement of statistical significance is a statement about odds. Nothing more, nothing less.

To call a result meaningful or of practical significance, we'd need to look beyond the tests of significance themselves. Suppose high school X's senior class always averaged a statistically significant 15 points higher on the SAT math than high school Y's senior class. Is that of any practical consequence? Well, if students from the two high schools attended similar colleges and X students always had higher college grade point averages, we might be inclined to say it's a meaningful difference and to look for a reason: could be better teachers at X, could be better curricula, could be X students just take more math courses in high school. Or X students' parents' have more education themselves. Or some combination. Does the difference in college grades make any difference later on in life? Is X's superiority in math offset by Y's superiority somewhere else?

The issue isn't settled yet: What about the cost-benefit of doing something about the X-Y math gap? What would it cost Y to get its students up to or past X? Would Y have to pay more for teachers or pay for more teachers? Would Y have to give up some other cherished part of its curriculum (which is happening in some places these days in the reading-and-math-uber-alles world of No Child Left Behind--bye-bye music, arts, even P.E.)?

You can run experiments and program evaluations and look at whether or not the results are statistically significantly different. But in the end you have to make judgments about what to do, if anything, wherein statistics can help little, if at all. Alas, an erroneous definition of statistical significance is only one of many problems that afflict the NSBA Guide.