11/05/2012 09:48 am ET Updated Dec 06, 2017

Going Negative (or Postive): Sentiment Analysis on the 2012 Republican Primary Transcripts

Positive versus negative campaigning is a common discussion topic in politics. The 2012 campaign has so far been described as highly negative, perhaps the most negative ever. Political scientists treat claims such as these skeptically -- Centre government professor Dr. Benjamin Knoll will point you to this video about the election of 1800. However, in light of this discussion, it may be interesting to examine the role of positive and negative word choice in the Republican primary that led up to Governor Romney's candidacy.

I should note that determining emotional content of written language is one of the hardest problems in artificial intelligence. Computers struggle with even relatively simple natural language -- sarcasm, humor and even negation present difficulties. Computational approaches to text processing always come with a caveat on reliability, but with sentiment analysis it should especially be kept in mind. Disclaimers aside, let's see what emerges when we analyze the Republican primary debate text using a very simple automated sentiment analysis technique.

Methods for assessing sentiment on a binary, either-or scale can achieve reasonably useful accuracy. Classifying movie reviews as positive or negative, for example, can be done surprisingly well based on the words used (a demo of such a system is available here). The process assesses the sentiment of each word in a statement and averages the scores together, assigning words not in the lexicon a sentiment value of zero. While we can use this to assess the tone of words used in the debates, which can provide some indirect measure of a candidate's negativity. (A more accurate measure would require some fairly sophisticated sentence-level semantic analysis.)

We used a word list designed for social media research, with scores ranging from -5 to 5. Social media and political debates are admittedly different contexts, but we believe they share enough in common that this is not too much of a stretch.

Below is a histogram of the sentiment scores of all statements in the 2012 Republican primary debate transcripts. The statements are grouped into 50 bins based on score; for example, the large central bar indicates that there were approximately 750 statements scoring between 0.0 and 0.2.


Most statements have a score near 0, indicating neutrality. This is to be expected, as in a complex expression there are likely both positive and negative words, cancelling each other out. Overall, there appear to be more positive comments than negative. Remember, however, that this does not necessarily mean that the debates were more positive than negative; it does show that in a slight majority of the candidates' statements, words scored as "positive" outweighed words scored as "negative."

When broken down by candidates, the graph becomes more interesting. The following is a box-and-whisker plot of the sentiment values of the statements of each of the nine candidates we analyzed (former Minnesota Governor Tim Pawlenty, who participated in only two debates, was excluded in previous posts because we did not have enough data for a good word-frequency analysis).


A word of explanation is in order here. The blue rectangles indicate the range of the middle 50 percent of scores for each candidate; for example, roughly 50 percent of Mr. Romney's statements scored between 0 and 1. The dashed lines, or whiskers, encompass the vast majority of the data, with outliers represented as crosses outside of the whisker. The red line and green dot within each box represent the median and mean score, respectively, for each candidate. Again using Mr. Romney as an example, his median score is 0.25 and the mean sentiment of all of his statements is 0.42. His mean sentiment is actually the highest of all nine candidates (although Governors Huntsman and Pawlenty are very close). This fits with a common pattern observed in in campaign advertising: challengers are more likely to "go negative" than incumbents.

Another interesting observation is that the moderate candidates -- Huntsman, Pawlenty and Romney -- appear to have been slightly more positive (in their word choice) than the others. This is in line with the common perception of more ideologically extreme candidates as more negative. It should be noted, though, that Governors Huntsman and Pawlenty made fewer statements during the debates than the other candidates. Also, the comparable sentiment score range for Senator Santorum (widely considered one of the most conservative candidates) casts doubt on both of these the interpretations.

Ron Paul appears to be the most negative of the candidates. This does not mean that the Texas congressman is the most negative of the candidates, but that the words he uses are classified as more negative. We checked the sentiment scores of the words that qualified for inclusion in our TF-IDF analysis, and found that he had the most negative sentiment score for individual words, as well. This suggests the latter possibility is likely, but a more complex analysis would be necessary to rule out the former.

So far the analysis of the debates has been based on all of the words used by the candidates (excluding some common "glue" words, like prepositions and articles). My next post will focus on specific words -- the candidates' names as spoken by each other.