Alex Lundry is Vice President and Director of Research for TargetPoint Consulting, a conservative political polling, microtargeting, and knowledge management firm. You can connect with him on Twitter where he expresses his opinions with great clarity so as to avoid confounding CMU's sentiment analysis.
Researchers at Carnegie Mellon have shown that unstructured text data pulled from Twitter can in some instances be used as a reliable substitute for opinion polling (link to study PDF). The results are impressive, and though pollsters needn't start looking for another line of work, I think they ignore this study at their peril.
Using very simple tweet selection mechanisms along with measures of the tweet's sentiment ("Obama's awesome" = approve, "Obama sucks" = disapprove), these researchers were able to:
However, the same methodology failed miserably when it came to the 2008 presidential horse race obtaining a correlation of r=-8% with Obama's level of support in the Gallup tracker.
It seems then that aggregate Twitter sentiment shows great promise as a polling substitute for high volume and relatively binary opinions and attitudes: are you hot or cold on the economy, do you like or dislike the President? But the polynomial nature of items like a campaign horserace or the health care debate makes it difficult to extract meaningful opinions amid a crush of unstructured data.
Yet this is no reason for pollsters to shrug away these results. There is great predictive power hidden away inside this sort of latent data just waiting for the extraction of opinions, attitudes and trends in voter sentiment. Pollsters would be wise to begin incorporating these data into their work: analyzing Google Trends search data, counting Facebook friends, YouTube views and web traffic, or simply doing more with the rich verbatim data we typically capture in our surveys and focus groups. (And it's not just politics where this is applicable; tweet volume and sentiment have also been shown to be an incredibly accurate predictor of a movie's box office returns).
This study also highlights a debate the polling community must have sooner or later: can the shortcomings of dirty data be overcome by a mix of sheer volume, sound data preparation/manipulation and savvy analysis? In this new era of IVR, online panels, social media and big data, the answer is increasingly pointing to yes - especially when you consider the advantages of speed, cost and access that these non-traditional data collection methods enjoy.
Finally, it's worth taking a moment to consider just how stunningly impressive these results are. What level of precision might there have been with a more sophisticated methodology? Tweets were selected for study based merely upon the presence of a single word - imagine the accuracy if selection allowed for the use of synonyms, alternate spellings or Boolean operators. Moreover, as the researchers themselves point out, there were no geographical restrictions and no consideration of either online idioms or the practice of retweeting.
This is an exciting, important study, and the polling community should be taking it very seriously. It is well worth your time to read the whole thing, and I'm very curious to hear your take on it in the comments section below.