02/12/2008 03:29 pm ET | Updated May 25, 2011

Regression Analysis of the Democratic Race

Over the last few days, a number of political scientist bloggers have turned their statistical firepower on the Democratic presidential race, producing some analyses that are both tantalizing in their implications and confusing for those unfamiliar with multiple regression analysis. The most interesting posts come from Brendan Nyhan, Jay Cost and DailyKos diarist Poblano. Other than pointing you to their efforts, here are a few thoughts.

In many ways, the Democratic contest is the perfect problem for multiple regression analysis. Many different important variables appear to be strongly related candidate support: race and ethnicity, gender, age, socio-economic status and whether voters participate in a primary or caucus (to name just the most obvious). We are really interested in understanding the independent effects of each of these factors. You can see crude efforts along these lines in the exit poll tabulations: How does vote preference vary by gender or age, for example, once we control for race? The promise of multiple regression is the ability to estimate the independent effects for a large number of different variables on vote choice while controlling for all of them simultaneously.

Another tempting feature of multiple regression analysis -- at least in theory -- is the ability to take a model that does a good job predicting the Obama-Clinton vote looking backwards, plug values for the upcoming contests for each of the variables into the model (race, gender, age, etc) and attempt to predict the outcomes. The lure of predicting "what might happen at the end of an active campaign" (as Poblano put it), is what led Bill Kristol to cite Poblano in his New York Times column. Obviously, if it were possible, we would all like to use hard data to anticipate what might happen in Ohio, Texas or Pennsylvania.

At the same time, the efforts by the aforementioned bloggers also demonstrate just how complex and challenging multiple regression analysis can be when applied to real world problems using real world data. Here are three reasons to be cautious about interpreting the models linked to above:

1) The data are imperfect. As Jay Cost explains, we have a choice between two kinds of data. "Micro-level" exit poll data and "macro-level" data from statewide results. Exit polls collect data on the vote preferences and characteristics of individual voters. That level of data is idea, since we want to understand how individuals vote (not states or counties). Unfortunately, for now, only the subgroup exit poll tabulations are available and not for all states. The networks have not conducted "entrance polls" for most of the smaller caucus states.

Data is plentiful at the aggregate level (mostly states) but far less precise. One problem is that Census data (on race, age, religion or socio-economic status) is based on the total population rather than those who participated in the Democratic primaries or caucuses. We also have a relatively small number of states to consider, and we have to deal with the statistical problem that populations sizes vary considerably from state to state.

2) The models are poor predictors of the future. The limitations of the data are one reason why these sorts of regression models make for poor predictors of future outcomes. Consider the predictive accuracy of Poblano's model. He says it explained 95% of the variation in 26 states that voted through February 5 and reports estimates that predicted Obama's actual share of the vote within these states "within an average of two points." However, as TNR's Josh Patashnik points out, the model overestimated Obama's support in Louisiana (+11 points) and Nebraska (+8) and understated it in Washington (-14) and Maine (-7). The reason is something statisticians call "overfitting" "overestimation". The number of variables in Poblano's model (9) was large relative to the number of cases involved (26 states). So the "fit" of Poblano's model to the past data is deceiving because it is, in essence, too good. The 95% of variance explained is unique to those 26 states and thus does not generalize to predict the results in other states with anywhere near as much precision.

Reducing the number of variables does not solve the problem, it just makes the "fit" of the model to the existing data less predictive (though more realistic). Jay Cost explains why his own model is a decent vehicle for explaining the existing data but a poor predictor of future outcomes:

The model's predictive power (69%) is very high from a certain perspective. From another perspective, though, its accuracy is not great enough to [allow for] "publishable" predictions - not when candidates are often separated by tiny margins.

3) Demography is not always destiny. Or to put it another way, campaigns matter. At least that is the underlying assumption behind all the personal campaigning, field organizing and paid advertising that both campaigns are doing. The one thing these models lack is a better measurement of the influence of the various means of campaigning. Once again, a lack of decent data is the primary culprit. For example, we do not yet have FEC reports providing decent breakdowns of how much the candidates spent in each state. Also, the University of Wisconsin's Advertising Project will ultimately have breakdowns of what each candidate spent on television advertising in each media market, but those data are not yet in the public domain. The impact of campaigning so far is important. Will it matter, for example, that the campaigns will now slow down enough so that the candidates can devote significantly more time and paid advertising to states like Ohio, Texas and Pennsylvania than they did the Super Tuesday states?

Jay Cost's model does include "number of candidate visits" as a variable meant to "measure campaign effects per state." He reports that:

Clinton does better as the number of candidate visits increases. This was a bit of a surprise, but it is good news for her. Campaign effects seem to incline the electorate to her.

This finding is intriguing, but I wonder how the results might differ if Cost had used separate variables for the visits of each candidate rather than just the total number of visits for all candidates.

Mechanical issues of this sort help illustrate one of the practical limitations of regression modeling. It is a very powerful tool, but it is also sensitive to decisions the analyst makes about what data to use and what variables to include. We will no doubt see more attempts to model the primary campaign in the future. Do not be surprised if reasonable people disagree about what data is most appropriate, what model best "fits" the data and about which conclusions are best supported.

Update: Just want to underline a point that may have been unclear. Neither Brendan Nyhan nor Jay Cost used their regression models to try to predict future outcomes.