I have been posting quite a bit lately on the subject of the transparency of Nate Silver's recently updated pollster ratings, so it was heartening to see his announcement yesterday that FiveThirtyEight has established a new process to allow pollsters to review their own polls in his database. That is a very positive step and we applaud him for it.
I haven't yet expressed much of an opinion on the ratings themselves or their methodology, and have hesitated to do so because I know some will see criticism from this corner as self-serving. Our site competes with FiveThirtyEight in some ways, and in unveiling these new ratings, Nate emphasized that "rating pollsters is at the core of FiveThirtyEight's mission, and forms the backbone of our forecasting models."
Pollster and FiveThirtyEight serve a similar mission, though we approach it differently: Helping those who follow political polls make sense of the sometimes conflicting or surprising results they produce. We are, in a sense, both participating in a similar conversation, a conversation in which, every day, someone asks some variant of the question, "Can I Trust This Poll?"
For Nate Silver and FiveThirtyEight, the answer to that question often flows from their ratings of pollster accuracy. During the 2008 campaign season, Nate leaned heavily on earlier versions of his ratings in posts that urged readers to pay less attention to some polls and more to others, with characterizations running the gamut from "pretty awful" or "distinctly poor" to the kind of pollster "I"d want with me on a desert island." He also built those ratings into his forecasting models, explaining to New York Magazine that other sites that average polls (among them RealClearPolitics and Pollster.com) "have the right idea, but they're not doing it quite the right way." The right way, as the article explained, was to average so that "the polls that were more accurate [would] count for more, while the bad polls would be discounted."
For better or worse, FiveThirtyEight's prominence makes these ratings central to our conversation about how to interpret and aggregate polls, and I have some serious concerns about the way these ratings are calculated and presented. Some commentary from our perspective is in order.
Let's start with what's good about the the ratings.
First, most pollsters see value in broadly assessing poll accuracy. As the Pew Research Center's Scott Keeter has written (in a soon to be published chapter), "election polls provide a unique and highly visible validation of the accuracy of survey research," a "final exam" for pollsters that "rolls around every two or four years." And, while Keeter has used accuracy measurements to assess methodology, others have used accuracy scores to tout their organizations' successes, even if their claims sometimes depend on cherry-picked methods of scoring, cherry-picked polls or even a single poll. So Silver deserves credit for taking on the unforgiving task of scoring individual pollsters.
Second, by gathering pre-election poll results across many different types of elections over more than ten years, Silver has also created a very useful resource to help understand the strengths and weaknesses of pre-election polling. One of the most powerful examples is the table, reproduced below, that he included in his methodology review. It shows that poll errors are typically smallest for national presidential elections and get bigger (in ascending order) for polls on state-level presidential, senate, governor, and primary elections.
Third, I like the idea of trying to broaden the scoring of poll accuracy beyond the final poll conducted by each organization before an election. He includes all polls with a "median date" (at least halfway completed) within 21 days of the election. As he writes, we have seen some notable examples in recent years of pollsters whose numbers "bounce around a lot before 'magically' falling in line with the broad consensus of other pollsters." If we just score "the last poll," we create incentives for ethically challenged pollsters to try to game the scorecards.
Of course, Silver's solution creates a big new challenge of its own: How to score the accuracy of polls taken as many as three weeks before an election while not penalizing pollsters that are more active in races like primary elections that are more prone to huge late swings in vote preference. A pollster might provide a spot-on measurement of a late breaking trend in a series of tracking polls, but only their final poll would be deemed "accurate."
Fourth, for better or worse, Silver has already done a service by significantly raising the profile of the Transparency Initiative of the American Association for Public Opinion Research (AAPOR). Much more on that subject below.
Finally, you simply have to give Nate credit both for the sheer chutzpah necessary to take on the Everest-like challenge of combining polls from so many different types of elections spanning so many years into a single scoring and ranking system. It's a daunting task.
A Reality Check
While the goals are laudable, I want to suggest a number of reasons to take the resulting scores, and especially the rankings of pollsters using those scores, with huge grains of salt.
First, as Silver himself warns, scoring the accuracy of pre-election polls has limited utility. They tell you something about whether pollsters "accurately [forecast] election outcomes, when they release polls into the public domain in the period immediately prior to an election." As such:
The ratings may not tell you very much about how accurate a pollster is when probing non-electoral public policy questions, in which case things like proper question wording and ordering become much more important. The ratings may not tell you very much about how accurate a pollster is far in advance an election, when definitions of things like "likely voters" are much more ambiguous. And they may not tell you very much about how accurate the pollsters are when acting as internal pollsters on behalf of campaigns.
I would add at least one more: Given the importance of the likely voter models in determining the accuracy of pre-election polls, these ratings also tell you little about a pollsters' ability to begin with a truly representative sample of all adults.
Second, even if you take the scores at face value, the final scores that Silver reports vary little from pollster to pollster. They provide little real differentiation among most of the pollsters on the list. What is the range of uncertainty, or if you will, the "margin of error" associated with the various scores? Silver told Markos Moulitsas that "the absolute difference in the pollster ratings is not very great. Most of the time, there is no difference at all."
Also, in response to my question on this subject, he advised that while "estimating the errors on the PIE [pollster-introduced error] terms is not quite as straightforward as it might seem," he assumes a margin of error "on the order of +/- .4" assuming a 95% confidence level. He adds:
We can say with a fair amount of confidence that the pollsters at the top dozen or so positions in the chart are skilled, and the bottom dozen or so are unskilled i.e. "bad". Beyond that, I don't think people should be sweating every detail down to the tenth-of-a-point level.
That information implies, as our commenter jme put it yesterday that "his model is really only useful for classifying pollsters into three groups: Probably good, probably bad and everyone else." And that assumes that this confidence is based on an actual computation of standard errors for the PIE scores. Commenter Cato has doubts.
But aside from the mechanics, if all we can conclude is that Pollster A produces polls that are, on average, a point or two less variable than Pollster B, do these accuracy scores help us understand why, to pick a recent example, one poll shows a candidate leading by 21 points and another shows him leading by 8 points?
Third, even if you take the PIE scores at face value, I would quarrel with the notion that they reflect pollster "skill." This complaint that has come up repeatedly in my conversations with survey methodologists over the last two weeks. For example, Courtney Kennedy, a senior methodologist for Abt SRB, tells me via email that she finds the concept of skill "odd" in this context:
Pollsters demonstrate their "skill" through a set of design decisions (e.g., sample design, weighting) that, for the most part, are quantifiable and could theoretically be included in the model. He seems to use "skill" to refer to the net effect of all the variables that he doesn't have easy access to.
Brendan Nyhan, the University of Michigan academic who frequently cross-posts to this site, makes a similar point via email:
It's not necessarily true that the dummy variable for each firm (i.e. the "raw score") actually "reflects the pollster's skill" as Silver states. These estimates instead capture the expected difference in accuracy of that firm's polls controlling for other factors -- a difference that could be the result of a variety of factors other than skill. For instance, if certain pollsters tend to poll in races with well-known incumbents that are easier to poll, this could affect the expected accuracy of their polls even after adjusting for other factors. Without random assignment of pollsters to campaigns, it's important to be cautious in interpreting regression coefficients.
Fourth, there are good reasons to take the scores at something less than face value. They reflect the end product of a whole host of assumptions that Silver has made about how to measure error, and how to level the playing field and control for factors -- like type of election and timing -- that may give some pollsters an advantage. Small changes in those assumptions could alter the scores and rankings. For example, he could have used different measures of error (that make different assumption about how to treat undecided voters), looked at different time intervals (Why 21 days? Why not 10? Or 30?), gathered polls for a different set of years or made different decisions about the functional form of his regression models and procedures. My point here is not to question the decisions he made, but to underscore that different decisions would likely produce different rankings.
Fifth, and most important, anyone that relies on Silver's PIE scores needs to understand the implications of his "regressing" the scores to "different means," a complex process that essentially gives bonus points to pollsters that are members of the National Council of Public Polls (NCPP) or that publicly endorsed AAPOR's Transparency Initiative prior to June 1, 2010. These bonus points, as you will see, do not level the playing field among pollsters. They do just the opposite.
In his methodological discussion, Silver explains that he combined NCPP membership and endorsement of the AAPOR initiative into a single variable and found, with "approximately" 95% confidence, "that the [accuracy] scores of polling firms which have made a public commitment to disclosure and transparency hold up better over time." In other words, the pollsters he flagged with an NCPP/AAPOR label appeared to be more accurate than the rest.
His PIE scores include a complex regressing-to-the-mean procedure that aims to minimize raw error scores that are randomly very low or very high for pollsters with relatively few polls in his database. And -- a very important point -- he says that the "principle purpose" of these scores is to weight pollsters higher or lower as part of FiveThirtyEight's electoral forecasting system.
So he has opted to adjust the PIE scores so that NCPP/AAPOR pollsters get more points for accuracy and others get less (he applies an analogous penalty for pollsters that conduct surveys over the internet). The adjustment effectively reduces the PIE error scores by as much as a half point for pollsters in the NCPP/AAPOR category. Pollsters with the least number of polls in his database get the biggest boost in their PIE scores. He also awards a similarly sized and analogous penalty to three firms that conduct surveys over the internet. His explains that his rationale is "not to evaluate how accurate a pollster has been in the past -- but rather, to anticipate how accurate it will be going forward."
Read that last sentence again, because it's important. He has adjusted the PIE scores that he uses to rank "pollster performance" not only on their individual performance looking back, but also on his prediction on how they will perform going forward.
Regular readers will know that I am an active AAPOR member and strong booster of the initiative and efforts to improve pollster disclosure generally. I believe that transparency may tell us something, indirectly, about survey quality. So I am intrigued by Silver's findings concerning the NCPP/AAPOR pollsters as a group, but I'm not a fan of of the bonus/penalty point system he built into the ratings of individual pollsters. Let me show you why.
The following is a screen-shot of the table Silver provides that ranks all 262 pollsters, showing just the top-30. Keep in mind this is what his readers get to when they click on the "Pollster Ratings" tab displayed prominently on tab at the top of FiveThirtyEight.com:
The NCPP/AAPOR pollsters are denoted with a blue star. They dominate the top of the list, accounting for 23 of the top 30 pollsters.
But what would have happened had Silver awarded no bonus points? We don't know for certain, because he provided no PIE scores calculated any other way, but we did our best to replicate Silver's scoring method but recalculating the PIE score without any bonus or penalty points (regressing the scores to the single mean of 0.12). That table appears below.**
[I want to be clear that the following chart was not produced or endorsed by Nate Silver or FiveThirtyEight.com. We produced it for demonstration purposes only, although we tried to replicate his calculations as closely as we could. Also note that the "Flat PIE" scores do not reflect Pollster.com's assessment or ranking of pollster accuracy, and no one should cite them as such].
The top 30 look a lot different once we remove the bonus and penalty points. The number of NCPP/AAPOR designated pollsters in the top 30 drops from 23 to 7 (although the 7 that remain all fall within the top 13, something that may help explain the underlying NCPP/AAPOR effect that Silver reports). Those bumped from the top 30 often move far down the list. You can download our spreadsheet to see all the details, but nine pollsters awarded NCPP/AAPOR bonus points drop in the rankings by 100 or more places.
[In a guest post earlier today on Pollster.com, Monmouth University pollster Patrick Murray describes a very similar analysis he did using the same data. Murray regressed to the PIE scores to a different single mean (0.50), yet describes a very similar shift in the rankings].
Now I want to make clear that I do not question Silver's motives in regressing to different means. I am certain he genuinely believes the NCPP/AAPOR adjustment will improve the accuracy of his election forecasts. If the adjustment only affected those forecasts -- his poll averages -- I probably would not comment. But they do more than that. His adjustments appear to significantly and dramatically alter rankings prominently promoted as "pollster ratings," ratings that are already having an impact on the reputations and livelihoods of individual pollsters.
That's a problem.
And it adjusts those ratings in a way that's not justified by his finding. Joining NCPP or endorsing the AAPOR initiative may be statistically related to other aspects of pollster philosophy or practice that made them more accurate in the past, but no one -- not even Nate Silver -- believes that a mere commitment made a few weeks ago to greater future transparency caused pollsters to be more accurate over the last ten years.
Yet in adjusting his scores as he does, Silver is increasing the accuracy ratings of some firms and penalizing others on those grounds, in a way that is also contrary to AAPOR's intentions. On May 14, when AAPOR's Peter Miller presented the initial list of organizations that had endorsed the transparency initiative, he specifically warned his audience that many organizations would soon be added to the list because "I have not been able to make contact with everyone" while others faced contractual prohibitions Miller believed could be changed over time. As such, he offered this explicit warning: "Don't make any inferences about blanks up here, [about] names you don't see on this list."*
And one more thought: If you look back at both tables above, you will notice Silver strikes out the name Strategic Vision, LLC, and marks with a black "x", because he concludes that its polling "was probably fake," cracks the top-30 "most accurate" pollsters (of 262) on both lists.
If a pollster can reach the 80th or 90th percentile for accuracy with made up data, imagine how "accurate" a pollster can be by simply taking other pollsters' results into account when tweaking their likely voters model or weighting real data. As such, how useful are such ratings for assessing whether pollsters are really starting with representative samples of adults?
My bottom line: These sort of pollster ratings and rankings are interesting, but they are of very limited utility in sorting out "good" pollsters from "bad."
**Silver has not, as far as I can tell, published the mean he would regress PIE to had he chosen to regress to a single mean. I arrived at 0.12 based on an explanation he provided to Doug Rivers of YouGov/Polimetrix (who is also the owner of Pollster.com) that Rivers subsequently shared with me: "the [group mean] figures are calibrated very slightly differently than the STATA output in order to ensure that the average adjscore -- weighted by the number of polls each firm has conducted -- is exactly zero." A "flat mean" of 0.12 creates a weighted average adjscore of zero. I emailed Silver this morning asking if he could confirm. As of this writing he has not responded.
**In the interests truly full transparency, I should disclose that I suggested to Nate that he look at pollster accuracy among pollsters that had endorsed the AAPOR Transparency Initiative before he posted his ratings. He had originally found the apparent effect looking only at members of NCPP, and he sent an email to Jay Leve (of SurveyUSA), Gary Langer (polling director of ABC News) and me on June 1 to share the results and ask some additional questions, including: "Are there any variables similar to NCPP membership that I should consider instead, such as AAPOR membership?" AAPOR membership is problematic, since AAPOR is an organization of individuals and not firms, so I suggested he look at the Transparency Initiative list. In his first email, Silver also mentioned that, "the ratings for NCPP members will be regressed to a different mean than those for non-NCPP members." I will confess that at the time I had no idea what that meant, but in fairness, I certainly could have raised an objection then and did not.