As Centre College prepares excitedly for the Vice Presidential Debate on Oct. 11, its government and computer science programs have collaborated on a research project that applies natural language processing and text mining techniques to the transcripts of the Republican primary debates. The debates provide a source of speech from multiple candidates (we consider eight of the 10 who participated) in a similar environment and over an extended period (20 debates spanning from June 2011 to February 2012). This is the first in a series of articles discussing my summer research, which was performed in collaboration with computer science professors Dr. Forrest Stonedahl and Dr. Joseph Oldham, and government professor Dr. Benjamin Knoll.
Computers cannot yet match the human ability to read and comprehend, although artificial intelligence researchers are making great strides. However, they can perform many text-related tasks faster and more accurately than we can. With modern processing power, even a personal computer can power through thousands of lines of text in seconds, calculating useful (if simple) metrics. Algorithms are even able to construct categorization schemes from scratch, allowing for automated or semi-automated classification.
The starting point for our analysis is the words each candidate uses. Examining word frequency and use can offer insights into speaking style, topics discussed, and how those topics are characterized. Since the candidates we considered spoke some 265,472 words across all of their appearances, a computational solution is preferable.
Using transcripts available at the American Presidency Project of the University of California, we wrote a program to sort the text by speaker and count the number of times words appear. Articles, pronouns, and a few other very common words that would otherwise overwhelm the more useful terms are excluded. The following collection of word clouds, generated using Wordle is an effective way to visualize the results.
Many of the most frequently used terms are essential to any discussion of politics, such as "people," "government," and "president." While it may be interesting that all the candidates use the word "people" with high frequency, it does not say much about them individually. This illustrates a downside to using raw term frequency--it does not take overall word usage into account, resulting in a collection of highly weighted terms that are too generic to offer much insight.
Term frequency-inverse document frequency (TF-IDF), borrowed from information theory, provides another way to measure relative word use. (The Wikipedia entry for TF-IDF has a fairly good technical summary). This technique weights words based on the number of times they appear within a document (in this case, the full text of all of a candidate's statements in the debates) and the number of documents in which they appear. Under this scheme, a term used infrequently but exclusively by one candidate should have a higher weight than a term used repeatedly but by most of the candidates. In theory, the terms that this method elevates should be indicative of candidate priorities and speaking styles.
Several terms in Congressman Ron Paul's word cloud, such as "inflation," "monetary," and "fed," stand out as related to the topics with which he is commonly associated. "Policeman," is another, coming from his repeated use of the phrase "policemen of the world," as in his introduction to the Jan. 26, 2012, debate: "I am the champion of... a foreign policy based on strength, which rejects the notion that we should be the policemen of the world and that we should be a nation builder." While he uses it only seven times, no other candidate does, so it has a high weight here. In this case, it not only points us to a phrase unique to the candidate but also a position -- non-interventionist foreign policy -- on which he differs from the rest of the Republican field.
Several interesting topical terms appear in former House Speaker Newt Gingrich's cloud, like "NASA," "gas," and "1994," but they are not what stand out. His use of adverbs such as "virtually," "frankly," "dramatically," and "fundamentally" (identified by Dan Amira of New York magazine as Gingrich's "favorite word" is more a component of his speaking style. While we are not the first to identify the former Speaker's penchant for adverbs, this analysis brings it to the forefront, particularly emphasizing the adverbs that Gingrich uses considerably more frequently than his opponents.
Another observation concerns the way former Governor Mitt Romney conceptualizes immigration policy. Both the words "legally" and "illegally" rank highly in his word cloud. In reading through the statements where the terms occur, it becomes apparent that Governor Romney goes to great lengths to draw a distinction illustrated in this excerpt from the Dec. 15, 2011, debate:
My view is, people who have come here illegally, we welcome you to apply but you must get at the back of the line, because there are millions of people who are in line right now that want to come here legally. I want those to come here legally. Those that are here illegally have to get in line with everybody else.
By repeatedly differentiating "legal" from "illegal" immigrants, Romney may have been attempting to "walk the tightrope" and appeal to both a primary electorate with a hard-line immigration stance and a general election demographic of conservative Latino voters with a more moderate immigration position. As well, Romney has already announced his opposition to President Obama's recent decision to limit the deportation of young immigrants who were brought to the country illegally as children. The strong "legal" vs. "illegal" distinction that appears here could make an appearance during this fall's debates.
The three governors -- Jon Huntsman, Rick Perry, and Mitt Romney -- each mention their home states enough for them to appear in these clouds, suggesting that they talk about their records in office. However, "Massachusetts" does not appear nearly as large in Romney's cloud as "Utah" and "Texas" do in Huntsman's and Perry's, respectively. Given that Massachusetts is known as a blue state, and that his record as governor has been a subject of considerable debate, it is reasonable to expect that Romney might not have discussed it as much. However, he said the word "Massachusetts" 70 times! Because "Massachusetts" is used by each of the other candidates (except Herman Cain), presumably in attacking Romney, the term receives a lower weight than it otherwise would. Only three candidates besides Huntsman mention Utah, and Perry's 104 mentions of Texas cause a higher score even though all of the candidates use it as well. These numbers support the expectation that candidates would focus on the front-runner and his record (while Romney and Perry were both front-runners at one point or another, Huntsman does not appear ever to have broken 5 percent in a national poll).
One unusual finding is the high prominence of the word "boot" in Governor Perry's word cloud. At first, we thought it might have something to do with giving the Obama administration "the boot," i.e., kicking them out of office. However, it is actually used exclusively in the phrase "boots on the ground," referring to Border Patrol agents along the U.S.-Mexican border. This is a good illustration of the care that must be taken in using these results. While automated statistical analysis can imply interesting avenues of investigation, there is a danger of misunderstanding when words are removed from their original context.
The word-frequency analysis is intended to provide a broad overview of the candidates. Over the next few weeks, I will post several more summaries of the research we have done this summer.