"Can I trust this poll?" In Part I of this series I tried to present the growing clash between traditional polling methods and a new breed that breaks many of the old rules and makes answering this question difficult. In this post, I want to review the philosophies at work behind efforts to evaluate polls and offer a few suggestions about what we can do to assess whether poll samples are truly representative.
Those who assess polls and pollsters generally fall into two categories, those who check the methodology and those who check the results. Let's consider both.
Check the Methods - Most pollsters have been trained to assess polls by looking at the underlying methods, not the results they produce. The idea is that you do all you can to contact and interview a truly random sample, ask standardized, balanced, clearly-worded questions and then trust the results. Four years ago, my Hotline colleagues asked pollsters how they determine whether they have a good sample. The answer from Gary Langer, director of polling at ABC News, best captures this philosophy:
A good sample is determined not by what comes out of a survey but what goes into it: Rigorous methodology including carefully designed probability sampling, field work and tabulation procedures. If you've started worrying about a "good sample" at the end of the process, it's probably too late for you to have one.
A big practical challenge in applying this philosophy is that the definition of "rigorous methodology" can get very subjective. While many pollsters agree on general principles (described in more detail in Part I), we lack consensus on a specific set of best practices. Pollsters disagree, for example, about the process used to choose a respondent in sampled households. They disagree about how many times to dial before giving up on a phone number or about the ideal length of time a poll should be in the field. They disagree about when it's appropriate to sample from a list, about which weighting procedures are most appropriate, about whether automated interviewing methods are acceptable and more.
This lack of consensus has many sources: The need to adapt methods to unique situations, differing assessments of the tradeoffs between different potential sources of error and the usual tensions between the goals of cost and quality. Yet whatever the reason, these varying subjective judgments make it all but impossible to score polls using a set of objective criteria. All too often, methodological quality is in the eye of the beholder.
A bigger problem is that the underlying assumption -- that these rigorous, random-digit methods produce truly random probability sampling -- is weakening. The unweighted samples obtained by national pollsters now routinely under-represent younger and non-white people while routinely over-representing white and college educated Americans. Of course, virtually all pollsters weight their completed samples demographically to correct these skews. Also, many pollsters are now using supplemental samples to interview Americans on their cell phones in order to improve coverage of the younger "cell phone only" population.
Most of the time, this approach appears to work. Pre-election polls continued to perform well during the 2008 general election, matching or exceeding their performance in 2004 and prior years. But how long will it be before the assumptions of what SurveyUSA's Jay Leve calls "barge in polling" give way to a world in which most Americans treat a ringing phone from an unknown number the way they treat SPAM email? And when it does, how will we evaluate the newer forms of research?
Check the Results - When non-pollsters think about how to evaluate polls, their intuitive approach is different. They typically ask, well, how does the pollster compare in terms of accuracy? The popularity of Nate Silver and the pollster ratings he posted last year at FiveThirtyEight.com last year speaks to the desire of non-pollsters to reduce accuracy to a simple score.
Similarly, pollsters also understand the importance of the perceived accuracy of their pre-election poll estimates. "The performance of election polls," wrote Scott Keeter and his Pew Research Center colleagues earlier this year, "is no mere trophy for the polling community, for the credibility of the entire survey research profession depends to a great degree on how election polls match the objective standard of election outcomes."
So what's the problem in using accuracy scores to evaluate individual pollsters? Consider some important challenges. First, pollsters do not agree on the best way to score accuracy, with the core disagreement centering on how to treat the undecided percentage that appears nowhere on the ballot. And for good reason. Differences in scoring can produce very different pollster accuracy rankings.
Second, the usual random variation in individual poll results due to simple sampling error gives especially prolific pollsters -- those active in many contests -- an advantage in the aggregate scores over those that poll in relatively few contests. Comparisons for individual pollsters get dicey when the number of polls used to compute the score gets low.
Third, and probably most important, scoring the accuracy this way tells us about only one particular measure (the vote preference question) on one type of survey (pre-election) at one point in the campaign (usually the final week). Consider the chart below (via our colleague Charles Franklin). It plots the Obama-minus-McCain margin on roughly 350 surveys that tracked national popular vote between June and November, 2008. An assessment of pollster error would consider only the final 20 or so surveys -- the points plotted in red.
Notice how the spread of results (and the frequency of outliers) is much greater from June to October than in the final week (the standard deviation of the residuals, a measurement of the spread of points around the trend line, falls from 2.79 for the grey points from June to October to 1.77 for the last 20 polls in red). Our colleague David Moore has speculated about some of the reasons for what he dubs "the convergence mystery" (here and here; I added my own thoughts here with a related post here). But whatever you might conclude about the reasons for this phenomenon, something about either voter attitudes or pollster methods was clearly different in the final week before the 2008 election. Assuming, as many pollsters do, that this phenomenon was not unique to 2008, how useful are the points in red from any prior election in helping us assess the "accuracy" of the grey points for the next one?
So what do we do? How can we evaluate new polling results when we see them?
The key issue here is, in a way, about faith. Not religious faith per se, but faith in random sampling. If we have a true random probability sample, we can have a high degree of faith that the poll is representative of the larger population. That fundamental philosophy guides most pollsters. The problem for telephone polling today is that many of the assumptions of true probability sampling are breaking down. That change does not mean that polls are suddenly non-representative, but it does make for a much greater potential than 10 or 20 years ago for skewed, flukey samples.
What we need is some way to assess whether poll samples are truly representative of a larger population that does not rely entirely on faith that "rigorous" methods are in place to make it so. I will grant that this is a very big challenge, one for which I do not have easy answers, especially for the random digit dial (RDD) samples of adults typically used for national polls. Since most pollsters already weight adult samples by demographics, their weighted demographic distributions are already representative. But what about other variables like political knowledge, interest or ideology? Again, I lack easy answers though perhaps as the quality of voter lists improve in the future, we may get better "auxiliary data" to help identify and correct non-response bias. But for now, our options for validating samples are very limited.
When it comes to "likely voter" samples, however, pollsters can do far better informing us about who these polls represent. As we have reported here and especially on my old Mystery Pollster blog over the years, there are almost as many definitions of likely voters as there are pollsters. Some use screen questions to identify the likely electorate, some use multiple questions to build indexes that either select likely voters or weight respondents based on their probability of voting. The questions used for this purpose can be about intent to vote, past voting, political interest or knowledge of voting procedures. Some select likely voters using registered voters lists and actual turnout records for the individuals selected from voter lists. So simply knowing that the pollster has interviewed 600 or 1,000 "likely voters" is not very informative.
The importance of likely voters around elections is obvious, but it is less apparent that many public polls of "likely voters" routinely report on wide variety of policy issues even in non-election years. These include the polls from Rasmussen Reports, NPR, George Washington University/Battleground and Democracy Corps. What is a "likely voter" in an odd-numbered year? Those who voted or tend to vote in higher turnout presidential elections? Those who intend to vote in non presidential elections? Something else?
One thing I have learned from five years of blogging on this topic is that some pollsters consider their likely voter methods proprietary and fiercely resist disclosure of the details. Some will disagree, but I think there are some characteristics that can be disclosed, much like food ingredients, without giving away the pollster's "secret sauce." These could include the following:
- In general terms, how are likely voters chosen - by screening? Index cut-off models? Weights? Voter file/vote history selection?
- What percentage of the adult population does the likely voter sample represent?
- If questions were used to screen respondents or or build an index, what are the text of questions asked?
- If voter lists were used, what sort of vote history (in general terms if necessary) defined the likely voters?
- Perhaps most important, what is the demographic and attitudinal (party, ideology) profile -- weighted and unweighted -- of the likely voter universe?
- Access to cross-tabulations, especially by party identification.
Regular readers will know that better disclosure of these details is a topic I return to often, but will also remember that obtaining consistent disclosure of such details can be difficult to impossible, depending on the pollster.
How can we help motivate pollsters to disclose more about their methods? I have an idea that I will explain in the third and final installment of this series.
Update: continue reading Part III.
[Note: I will be participating in a panel on Thursday at this week's Netroots Nation conference on "How to Get the Most Out of Polling." This series of posts previews the thoughts I am hoping to summarize on Thursday].
Follow Mark Blumenthal on Twitter: www.twitter.com/MysteryPollster