11/16/2006 05:46 pm ET | Updated May 25, 2011

A Surrender of Judgment? (Conclusion)

[This post concludes my comments started yesterday in response to a column by Washington Post polling director Jon Cohen.]

We chose to average poll results here on Pollster -- even for dissimilar surveys that might show "house effects" due to differences in methodology -- because we believed it would help lessen the confusion that results from polling's inherent variability. We had seen the way the simple averaging used by the site RealClearPolitics had worked in 2004, in particularly the way their final averages in battleground states proved to be a better indicator of the leader in each state than the leaked exit poll estimates that got everyone so excited on Election Day.

As Carl Bialik's "Numbers Guy" column on Wall Street Journal Online shows, that approach proved itself again this year:

Taking an average of the five most recent polls for a given state, regardless of the author -- a measure compiled by -- yielded a higher accuracy rate than most individual pollsters.

And in fairness, while I have not crunched the numbers that Bialik has, I am assuming that the RealClearPolitics averages performed similarly this year.

Readers have often suggested more elaborate or esoteric alternatives and we considered many. But given the constraints of time and budget and the need to automate the process of generating the charts, maps and tables, we ultimately opted to stick with a relatively simple approach.

Regardless, our approach reflected our judgment about how to best aggregate many different polls while also minimizing the potential shortcomings of averaging. The important statistical issues are fairly straightforward. If a set of polls uses an identical methodology, averaging those polls will effectively pool the sample size and reduces random error, assuming no trend occurs to change attitudes of the time period in which those polls were fielded.

In reality, of course, all polls are different and those differences sometimes produce house effects in the results. In theory, if we knew for certain that Pollsters A, B, C and D always produce "good" and accurate results, and Pollster E always produces skewed or biased results, then an average of all five would be less accurate than looking at any of the first four alone. The problem is that things are rarely that simple or obvious in the real world. In practice, house effects are usually only evident in retrospect. And in most cases, it is not obvious either before or after the election whether a particular effect -- such as a consistently higher or lower percentage of undecided voters -- automatically qualifies as inherently "bad."

So one reason we opted to average five polls (rather than a smaller number) is that any one odd poll would have a relatively small contribution to the average. Also, looking at the pace of polling in 2002, five polls seemed to be the right number to assure a narrow range of field dates toward the end of the campaign.

We also decided from the beginning that the averages used to classify races (as toss-up, lean Democrat, etc.) would not include Internet surveys drawn from non-random panels. This judgment was based on our analysis of the Internet panel polls in 2004, which had shown a consistent statistical bias in favor of the Democrats. One consequence was that our averages excluded the surveys conducted by Polimetrix,'s primary sponsor, a decision that did not exactly delight the folks who pay our bills and keep our site running smoothly. The fact that we made that call under those circumstances is one big reason why the "surrender judgment" comment irks me as much as it does.

Again, as many comments have already noted, we put a lot of effort into identifying and charting pollster house effects as they appeared in the data. On the Sunday before the election, we posted pollster comparison charts for Senate race with at least 10 polls (22 in all). On that day, my blog post gave special attention to the fairly clear "house effect," involving SurveyUSA:

A good example is the Maryland Senate race (copied below). Note that the three automated polls by SurveyUSA have all shown the race virtually tied, while other polls (including the automated surveys from Rasmussen Reports) show a narrowing race, with Democrat Ben Cardin typically leading by roughly five percentage points.



Which brings me to Maryland. Jon Cohen is certainly right to point out that the Washington Post's survey ultimately provided a more accurate depiction of voters' likely preferences than the average of surveys released at about the same time. Democrat Ben Cardin won by ten percentage points (54% to 44%). The Post survey, conducted October 22-26, had Cardin ahead by 11 (54% to 43% with just 1% undecided and 1% choosing Green party candidate Kevin Zeese). Our final "last five poll average" had Cardin ahead by just three points (48.4% to 45.2%), a margin narrow enough to merit a "toss-up" rating.

So why were the averages of all the polls less accurate than one poll by the Washington Post? Unfortunately, in this case, one contributing factor was the mechanism we used to calculate the averages. As it happened, two of the "last 5" polls came from SurveyUSA, whose polls showed a consistently closer race than any of the other surveys. Had we simply omitted the two SurveyUSA polls and averaged the other three, we would have shown Cardin leading by four-point, enough to classify the race as "lean Democrat." Had we added in the two previous survey releases from the Baltimore Sun/Potomac Research and the Washington Post, the average would have shown Cardin leading by six.

John Cohen seems to imply that no one would have considered the Maryland races competitive had they adhered to polling's "gold standard.... interviewers making telephone calls to people randomly selected from a sample of a definable, reachable population." That standard would have omitted the Internet surveys, the automated surveys, and possibly the Baltimore Sun/Potomac Research poll (because it sampled from a list of registered voters rather than using a "random digit dial" sample). But it would have left the Mason-Dixon surveys standing, and they showed Cardin's lead narrowing to just three points (47% to 44%) just days before the election.

We are hoping to take a closer look at how the pollsters did in Maryland and across the country over the next month or so, and especially at cases where the results differed from the final poll averages. I suspect that the story will have less to do with the methods of sampling or interviewing and more to do with more classic questions of how hard to push uncertain voters and what it means to be "undecided" on the final survey.

Subscribe to the Politics email.
How will Trump’s administration impact you?