A week ago, I linked to two new pollster "report cards" prepared by SurveyUSA (one for all pollsters, one for the 14 most active this year), based on average accuracy scores for all pollsters that have released presidential primary surveys this year. I included a few paragraphs to try to add some perspective, both on these specific report cards and the subject of measuring pollster accuracy in general. I did not intend to be dismissive of SurveyUSA's work nor their generally excellent performance both this year and in prior years, although I can understand why some may have read it that way. Regardless, SurveyUSA's Jay Leve has posted a lengthy response worthy of further comment.
First, and most important, Leve's post highlights an error that I need to correct. I wrote:
SurveyUSA bases their ranking on one particular measure of polling error, which compares the margin between the percentages received by the first and second place finishers on election day to the margins as reported for the same two candidates on the final poll. There are other measures of poll error (SurveyUSA has posted a paper they authored that reviews eight such measures). Those critical of SurveyUSA will note that they typically report very small percentages for the "undecided" category, so they tend to do better on their measure of choice (Mosteller 5) which does not reallocate undecided voters [emphasis added].
The words in italics are not correct, at least according to the data that SurveyUSA includes on an interactive spreadsheet posted on their web site that summarizes head-to-head accuracy comparisons against other pollsters over the last five years. That spreadsheet shows that, if anything, the opposite is true: SurveyUSA tends to do a little worse relative to other pollsters on the Mosteller 5 measure than it does on other measures. I have corrected the original post, and I apologize for the error.
SurveyUSA is understandably sensitive to slights from the "traditional 'headset operator' telephone pollsters," who according to Leve, "have worked for 16 years to mock and marginalize the innovative work done by SurveyUSA." While there is some truth to that characterization, I hope readers will appreciate that I have not been among the "mockers." In fact, I took to the pages of Public Opinion Quarterly, the most respected journal of survey methodology, to advise that while "healthy skepticism is appropriate . . . a reflexive rejection of IVR as 'theoretically unsound' seems unwarranted." In the same article I quoted from a paper by an academic methodologist (Joel Bloom, now of SUNY-Albany), showing that SurveyUSA had "'performed at roughly the same level as other nonpartisan polling organizations in 2002,' though it did 'somewhat better' on 'most measures.'"
While it was unfair of me to imply that SurveyUSA "cherry-picked" (as Leve put it) a favorable measure for their 2008 report card, the issue of how the various measures of polling error handle the "undecided" category is important and may have implications for where some pollsters rank. That issue is the underlying theme of the paper on such measures that SurveyUSA linked to in their scorecard post. For the record, that paper makes the case that three other Mosteller measures (Mosteller 3, 4 and 6, but not Mosteller 5) should theoretically benefit a pollster with low undecided voters, and concludes by arguing for a new measure that "rewards the pollster whose estimate is not just the most precise, but whose numbers leave him/her the least amount of wiggle room." For their 2008 report card, however, SurveyUSA picked a measure that is typically tougher on them than the others available, and they deserve credit for that decision.
Aside from the issue of how to measure error, however, there are some additional issues still worth discussing. For example, Leve does step up and suggest at least one way to determine statistical significance from their error comparisons, but it is limited. I had raised the issue of how to identify "statistically meaningful" differences on a pollster scorecard because, to be perfectly honest, we have been discussing how to best create our own scorecard and provide appropriate guidance.
In his response, Leve points to their "Interactive Election Scorecard", a spreadsheet which (among other things) computes the odds of SurveyUSA besting their competitors over the five years of comparisons included therein. Unfortunately, the spreadsheet is not set up to allow for similar comparisons among other pollsters or (as far as I can tell) for comparisons filtered for individual election years. The 2008 report card tells us, for example, that Mason-Dixon has an average error score of 8.26 on 19 polls while ARG has a score of 8.50 on 20 polls. It tells us that SurveyUSA had an average error on 4.50 on 22 polls, while Gallup had an average error of 4.60 on 2 polls. Are those differences statistically meaningful? The point of these examples, by the way, is not to trash the SurveyUSA report card but to underscore that these are tricky questions.
The issue of timing -- which Leve promises to address in the future -- remains important. In my post, I wrote that the SurveyUSA report card is based:
[O]n the last poll conducted by each organization. Typically, surveys get more accurate as we get closer to election day, and the polls conducted a week or more before the election tend to be at a disadvantage when compared against those from organizations like SurveyUSA that typically continue to call right up until the night before the election. You can decide whether that issue is a "bug" in the report card or a critical "feature" in SurveyUSA's approach to pre-election polling.
I realize, in retrospect, that my argument and language were a little too glib. First, while polls generally tend to get more accurate as election day approaches, I do not know for certain that SurveyUSA has a meaningful advantage on these accuracy scores because they do more late polling. I can certainly think of specific races in which they have had such an advantage, but those are anecdotes. We have a still unresolved empirical question here as to how much of SurveyUSA's relative accuracy accrues from polling a bit later in the process than many of their competitors.
Let's assume for the sake of argument that SurveyUSA tends to score higher on accuracy measures because they field more polls later. One conclusion would be that their methodology -- which involves very short questionnaires and the ability to make a lot of calls for less money within a short period of time -- allows their clients to do more polling later in the campaign. The net result is a more accurate depiction of the horse race in the final hours of the campaign. In other words, under this hypothetical, the difference amounts to a "feature" not a "bug."
At the same time, again to the extent that differences in "accuracy" depend on timing, it may not be fair to describe all of the pollsters that tend to stop earlier as relatively "inaccurate." In some cases, their surveys may have been equally accurate at the time, but received lower accuracy scores because of shifts in vote preference that occurred in the final week of the campaign. Keep in mind that different surveys are done for different purposes, and those purposes sometimes come with methodological trade-offs. If a media organization wants to measure opinions on a wide variety of attitudes beyond the basic horse race question (especially if those measurements involve open-ended questions), then an automated methodology makes less sense. Moreover, media organizations that sponsor more in-depth surveys typically want to gather their data sooner, to drive stories over the final week of the campaign, rather than waiting until election eve to release the data.
We need to understand that different polls are done for different purposes and a one-size-fits-all measure of accuracy may not make sense for all polls. Either way, this is certainly a topic wide open for further commentary, debate and, ideally, more empirical evidence.
Follow Mark Blumenthal on Twitter: www.twitter.com/MysteryPollster