08/16/2013 06:52 pm ET | Updated Aug 16, 2013

HUFFPOLLSTER: Can Twitter Predict Elections? Not So Fast


That study on Tweets about congressional candidates wasn't quite as accurate in predicting winners as you've been led to believe. But it was a good week for pollsters (in New Jersey). This is HuffPollster for Friday, August 16, 2013.

TWITTER AND ELECTION FORECASTING - NEW QUESTIONS RAISED - HuffPollster has devoted considerable attention this week to a provocative op-ed by Fabio Rojas, an associate professor of sociology at Indiana University, claiming that "Twitter discussions are an unusually good predictor of U.S. House elections." So good, Rojas claimed, that he and his colleagues were able to predict "404 out of 406 competitive races." At least that's what the original version of the op-ed said, which would have been an amazingly impressive feat, given that most of the conventional forecasting models missed the predicted number of Democratic seats by 38 or or more seats. Thus, Rojas concluded, "this new world" of social media "will undermine the polling industry...Nearly every serious political campaign in the United States spends thousands, even millions, of dollars hiring campaign consultants who conduct these polls and interpret the results." Digital democracy will put these campaign professionals out of work. [Rojas WaPost Op-Ed, original version quoted by Forbes, Stochastic Democracy on 2010 forecasting models]

Oops. Not That Accurate - However, the Rojas Washington Post op-ed has since been corrected to say that the "Twitter data predicted the winner in 404 out of 435 competitive race." Thus, according to the correction, "In the 2010 data, the analysis predicted the winner 92.8 percent of the time." Rojas adds further explanation in a blog post published on Friday: "For the purposes of presenting the research to the public, we computed the rate of correct predictions (within the data), which was about 92.5%. I then multiplied this by all races (435). Therefore, the extrapolated number of correctly predicted races is 404 out of 435. If we use only the contested race subsample, we get 375 races out of 406 contested races. This is a correction of what I wrote in the op-ed, which accidentally combined these two estimates. The op-ed now contains the correction." But the mistake is significant, since it means the model Rojas refers to miscalled 31 districts rather than 2, as he originally implied. [ibid, Orgtheory.net]

Isn't 92.5 (or 92.8) percent very accurate? - Well, it might be, except, as Stu Rothenberg has noted, 406 House races weren't "competitive" in 2010, much less all 435: "Most races aren’t real competitions, of course. Relatively few House challengers run robust campaigns, and voters generally are unfamiliar with challengers. Since House re-election rates have been over 90 percent in 19 of the past 23 elections, you don’t need polls or tweet counts to predict the overwhelming majority of race outcomes. In most cases, all you need to know is incumbency (or the district’s political bent) and the candidates’ parties to predict who will win." More incumbents were defeated in 2010 than usual, but even then 85 percent of incumbents who ran won reelection. [Rothenberg Political]

The original study differs from the op-ed - The original paper, written by Rojas along with Joseph DiGrazia, Karissa McKelvey and Johan Bollen, includes no calculation of the rate of correct predictions. It simply finds that "the percentage of Republican-candidate name mentions correlates with the Republican vote margin in the subsequent election," and that this finding "persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composition." The paper also makes no extravagant predictions about a coming demise of pre-election polling. It merely concludes that "social media may very well provide a valid indicator of the American electorate," essentially the argument advanced by Republican pollster Alex Lundry in Thursday's Huffpollster. [SSRN, HuffPollster]

...And the 92.8 percent correct claim is misleading - The fact that the original study was based on two models raises an important question. Was the supposed 92.8 percent accuracy based on a "bivariate" model based on tweets alone or on a more complex "full model" that included "incumbency and district partisanship," two well known and powerful predictors of U.S. House rate outcomes? If you're guessing the latter, you guessed right. Indiana University PhD student Joe DiGrazia, the primary author on the original paper, confirms via email that "the full model predicted about 92.8% of races correctly (which is what Fabio refers to in his op-ed), the bivariate [model] predicts about 72% correctly. Though, most of the races it misses are very close races...Again, though, I want to reiterate that we do not perform an analysis of 'correct' or 'incorrect' predictions in the paper. The paper is just about establishing linear trends." The point of the paper is that Tweet counts help explain some meaningful variation the total vote for Congress, even after controlling for things like incumbency and district partisanship that are known to be strong predictors. But make no mistake: The model based on tweets alone would have predicted the wrong winner in roughly 111 U.S. House races in 2010, more than the number of races considered truly "competitive" that year.

Tweets mattered, but not much - Rob Santos, current president of the American Association for Public Opinion Research (AAPOR) in a "rebuttal" to Rojas published on Friday: "I reviewed the research paper that started this craze, entitled 'More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior.' Tweets contribute relatively little to the outcome (Republican vote margin) when you include the “big boys” of election prediction – incumbency status and district partisanship. For instance, being an incumbent predicts almost a 50,000 vote contribution to the Republican margin in their statistical model, whereas receiving 100 percent (all!) of tweet-mentions gets you only 155 votes. Moreover, the paper’s “full” prediction model may be almost as accurate when you exclude the tweet share and just rely on traditional variables. Unfortunately, the authors failed to include that insightful model in their paper." [WaPost]

Fabio Rojas defends the theory behind his results (and perhaps his op-ed too?) - Rojas, appearing earlier on Friday on MSNBC's Daily Rundown: "That's the theory we're using to explain the results, which is that all publicity or most publicity is usually good publicity...if you are in a race, and you're generating buzz then people are going to talk about you whether they like it or not, and the buzz is an indicator that you are picking up support, that you are on the verge of victory, because people in general don't like talking about losers." [MSNBC]

MEANWHILE, IN NEW JERSEY, POLLING GOT IT RIGHT - Quinnipiac, in a release: “The Quinnipiac University poll accurately measured Newark Mayor Cory Booker’s margin of victory in New Jersey’s U.S. Senate Democratic primary. An August 7 survey of likely Democratic primary voters by the independent Quinnipiac University showed Booker with 54 percent, followed by U.S Rep. Frank Pallone at 17 percent, U.S. Rep. Rush Holt at 15 percent and State Assembly Speaker Sheila Oliver at 5 percent, with 8 percent undecided. That survey gave Booker a margin over Pallone of 37 percentage points. In the most updated poll returns, Booker topped Pallone 59 – 20 percent, a winning margin of 39 percentage points. Holt had 17 percent, with 4 percent for Oliver.” While Quinnipiac fielded the only polling in the week before the primary, a July 11-July 14 Monmouth poll also gave Booker a 37 point margin of victory. [HuffPollster chart]

Tarrance Group’s Logan Dobson notes the degree of difficulty - Tweeting on Tuesday before votes were counted: “Polling is hard. Off year polling is harder. Off-off year polling is harder. Off-off year PRIMARY polling is...look, you get it.” [@LoganDobson]

IS OBAMA LOSING YOUNG VOTERS? - Harry Enten: “Compared to his average approval a month before the election, Obama's approval rating since the beginning of July has dropped 9.3pt, to 52.3%, among 18- to 29-year-olds. His approval among 30- to 49-year-olds has dipped by 5.7pt, to 45.5%. His approval among those older than 50 has stayed relatively stable comparatively. He's only down to 1.5pt among 50- to 64-year-olds, and 2.7pt among those older than 65. Indeed, his approval with 50- to 64-year-olds has actually been 0.2pt higher than among 30- to 49-year-olds....But why are young Americans rejecting Obama? I can think of possible sources. First, it may be because of the NSA spying revelations. As I noted back when the story initially broke in June, it was younger voters who were most likely to say they wouldn't support President Obama because of what they'd learned. Second, young people are hit hardest by an economy that many Americans still think is weak – and the youth are at least partly blaming Obama.” [Guardian]

AMERICANS JUDGE USE OF FOOD STAMPS - Emily Swanson and Arthur Delaney: “More Americans are annoyed by the idea of food stamp recipients using their benefits to buy expensive food than their using them to buy junk food, according to a new HuffPost/YouGov poll. According to the survey, 54 percent of Americans think people should not be allowed to use food stamps to buy expensive items such as crab legs, while only 32 percent said that they should be allowed to do so.” [HuffPost]

HUFFPOLLSTER VIA EMAIL! - You can receive this daily update every weekday via email! Just enter your email address in the box on the upper right corner of this page, and click "sign up." That's all there is to it (and you can unsubscribe anytime).

FRIDAY'S 'OUTLIERS' - Links to more news at the intersection of polling, politics and political data:

-Fifteen percent of U.S. working women say they have been denied a promotion because of their gender. [Gallup]

-Andrew Kohut notes that among Republicans, the most popular potential 2016 candidates are also the youngest. [Pew Research]

-The Atlantic compiles three examples of question wording making a difference. [Atlantic via @FactTank]

-Wonkblog’s Dylan Matthews tries his hand at an empirically-based advice column. [WaPost]

-The Gilmore Girls’ unrealistic eating habits get the infographic treatment. [HuffPost]

CORRECTION:An earlier version of this post incorrectly identified Joe DiGrazia as a professor. He is a PhD student.