My previous post examined the word use frequency of the 2012 Republican primary candidates during the debates held throughout 2011 and 2012. We also employed two more complex forms of association: word adjacency (words frequently spoken together in sequence), and a looser notion of clustering by words (statements tending to contain similar words). The former approach provides a similar analysis to the previous word frequency marking, but adds the potential to pick out common short phrases. The latter approach can be used to cluster co-occurring words into distinct categories, which may relate to topics within the political debate. Manually determining the political topic of each statement from the debates would be a long, labor-intensive process, and seems a perfect candidate for a computational solution.
Examining Adjacent Words
In addition to word frequency, we examined bigrams and trigrams, or sequences of two and three words, respectively ("n-grams" are a general concept frequently employed in language analysis). The word cloud representations can be accessed here. They were generated using similar methods to those in the previous post, with font size scaled according to TF-IDF, highlighting phrases used more exclusively by one candidate relative to his or her opponents.
Congressman Paul again offers many n-grams that invoke components of his well-known platform: "another war" (something we should not have), "business cycle" (something that his opponents need to "understand"), and the "Federal Reserve system." Among Gov. Romney's common n-grams are many that appear to relate to economic policy, such as "right course," "free economy," and "thousands of jobs," but not as many relating to his time as governor of Massachusetts ("pro-life governor" and "record as governor" are small but visible). This supports the conventional wisdom that he de-emphasized his political experience in favor of his "private-sector" experience. The phrases "come here legally" and "come here illegally" noted in the first post appear prominently among his trigrams.
As with word frequency, n-grams provide information about the overall performance of the candidates. A true topical analysis would examine smaller blocks of the debates, down to the statement level, which is where we turned next.
By dividing the statements of each candidate by subject matter (foreign policy, immigration, etc.), and tracking how their coverage changed over the course of the debates, we thought we might find some noteworthy patterns in the candidates' speech. Some variation is expected and depends on the questions the candidates are asked in each debate, but it also indicates individual preferences -- candidates are generally allowed to discuss subjects unrelated to the questions they are asked.
The typical methodology for automatic classification involves "training" a system on text with known categories; it "learns" the attributes associated with documents of each category and uses those associations to improve its predictions. But for the trouble of creating a training set and categorization scheme, we might as well classify statements by hand.
Document clustering -- a form of "unsupervised learning" -- does not require either prior knowledge of topics or predefined categories. By representing documents by the words they contain, it is possible to measure the "distance" between documents in a similar way to measuring the distance between two points on a graph. The "closer" two documents are, the more "similar" they are. In this case, each "document" consists of either a single uninterrupted statement spoken by a candidate, or the merger of consecutive statements that were separated by brief interjections. We again used TF-IDF to represent word use (this is a much more common use for the measure).
Because the particular algorithm we used (see the Wikipedia entry on k-means clustering for a more detailed explanation) uses a fixed number of clusters, we tried it with several values. We generated several sets of clusters for each number of clusters, since it also produces slightly different results each time even for the same number of clusters. From these we chose a set of clusters that appeared to best divide the statements by category.
There are six clusters in this clustering. The primary metric used to select it was the top 10 words by TF-IDF score across the full text of each cluster, which should be indicative of the topic(s) it contains. Those results are reproduced below.
Ten most common words in each of the six clusters from the "best" clustering. (Counting from zero is a computer science convention.)
The similarity of terms in most of the clusters is promising. We can infer that the statements in cluster 0 deal with foreign policy, cluster 2 with immigration, and clusters 4 and 5 with economics. In examining its contents, cluster 4 also deals with health care reform. The content of Cluster 3 is hard to guess from the word summary, but it consists of statements dealing with the candidate's records (such as Mitt Romney's record in Massachusetts and how Rick Santorum won in traditionally-blue Pennsylvania).
Cluster 1 appears to be a mixture of topics, and this graph of cluster size helps to explain why:
Number of statements in each cluster
Cluster 1 contains more statements than the other five clusters combined. So although the categorization looks good, there are not enough statements in each topical cluster to answer our initial question.
We can, however, make some potentially interesting observations based on the similarities between trials. An immigration cluster appears consistently across most trials, as does a foreign policy cluster. The others are less consistent. This suggests that the words used to discuss these two subjects, or at least some particular aspects of them, are consistent and unique relative to other topics. Most trials also feature a large cluster such as cluster 1 above, indicating a significant degree of homogeneity of word use across the debates.
A detailed analysis of the themes of these debates would require either a more advanced computational technique or laborious human analysis. In my next post, I will discuss using words to analyze emotional content on a very basic scale.