In a post last week that presented an automated survey of North Carolina voters, we described a three-point lead for John Edwards over Hillary Clinton (34% to 31%) among Democrats as "statistically insignificant" and said that a six-point advantage meant that Rudy Giuliani "runs ahead" of Newt Gingrich among Republicans (31% to 26%). But reader "Thomas" asked a good question:
When I look at the results for the Republican candidates, there's a 6-point gap between Giuliani and Gingrich. But the size of the sample is only 735. Do you think this gap between the two candidates is really statistically more significant than the gap between the two Democrats candidates? I'm especially concerned with the size of the samples, and the way the interviews were conducted (automatically).
Thomas' question gets at an important issue for pre-election polls: How do we know when a lead is really a lead?
Let's get to the heart of the matter: The PPP survey of North Carolina Republicans reported a "margin of error of +/- 3.6%." Presumably, Thomas doubled that margin (getting +/- 7.2%) and compared it to the 6 point margin separating Guiliani and Gingrich. That's the right instinct, because the reported "margin of error" applies to each percentage separately. Looking at it that way, if you apply the margin of error to each candidate's percentage, you get a set of ranges that overlaps: somewhere between 27.4% and 34.6% for Giuliani and 22.4% and 29.6% for Gingrich. So how can that be a significantly meaningful lead?
The issue gets a bit technical, but the bottom line is that the statistical formula for a confidence interval (the formal term for "margin of error") for the difference of two percentages from the same sample produces something slightly smaller than just doubling the reported margin of error. I'll let my colleague, Prof. Charles Franklin, explain:
While [doubling the margin of error] is the correct conclusion when there are only two possible survey responses, it is not correct when there are more than two possible responses, which is in fact virtually always the case. The difference between the "twice the margin of error" rule and the correct calculation for the confidence interval of a difference of multinomial proportions will depend on how large are the proportion of survey responses other than that of the top two candidates combined.
Franklin's paper** has the complete formula and more details for those interested (see also Kish, Survey Sampling , 1965, p. 498-501), but the bottom line is that the margin of error for a difference of two percentages gets slightly smaller as the percentage falling into other categories (undecided or third candidates) gets larger. Franklin illustrates that point with the following graphic. The horizontal blue lines represent the reported margins of error (times two) for various sample sizes. The diagonal purple lines show how the margin of error for the difference of two percentages declines as the total of the percentages on which they are based ("p1 + p2") decline.
In this case, the margin of error for the 31% to 25% Giuliani lead is +/- 5.43, which would be just barely significant. So what do we make of that? Thomas' question implies that we should be skeptical about "barely significant" differences given that, in this case, the survey was automated. Let's consider that.
First, we need to keep in mind that this sort of significance test only takes into account the purely random variation that comes from drawing a sample rather than interviewing the entire population. Other potential errors could come from low rates of coverage or response (provided that the missing respondents have different opinions than those interviewed) or from the wording of the questions or their order. Unfortunately, the "margin of error" as we know it is not a measure of total error. So while other sources of error may not alter that "statistical significance" the result might still be wrong. Poll consumers should keep that in mind.
Also, the error margins calculated above assume a "simple random sample," but most political polls involve some weighting and other minor deviations from pure random sampling, which increase the error margin slightly.
Finally, keep in mind that the reported margin assumes a 95% level of confidence. That is, we are 95% certain a 31% to 25% lead on simple random sample of 735 respondents did not occur by chance alone. But there is nothing magic about 95%, it is just the common accepted standard used by most public opinion pollsters. If we wanted to be 99% certain, that 6 point lead would just miss "statistical significance."
All of which brings us to a lesson: As Professor Franklin likes to put it, we gain little by getting obsessed with "statistical significance," except when we are a few days before an election (and even then, it helps to look at many surveys, as we do here on pollster, rather than few). For a survey like this one, the concept of statistical significance provides an objective check, but it is more of a guide than a source of absolute rules.
**Charles wanted to make a few small revisions to his paper, which we should have posted soon.
Follow Mark Blumenthal on Twitter: www.twitter.com/MysteryPollster