Upgrading Pollster's Trend Lines: The Kalman Filter
Regular readers have probably noticed a few subtle changes in our trend lines over the last 24 hours (and perhaps a few temporary glitches). The good news is that we have been rolling out the first in a series of long overdue upgrades I hinted at when we debuted HuffPost Pollster, and I want to take a few minutes to explain what's changing and why.
Those who have followed Pollster.com from the beginning may remember that we started with charts based on simple "last 5 poll" averages whose trend lines were anything but smooth. Simple averaging reduced the random noise associated with individual surveys, but much remained. So in 2007 we started plotting charts using a loess regression model that draws smooth trend lines to fit noisy polling data. Rather than report simple estimates, we've reported the value of the end-point of the trend lines, or what we have called "trend estimates."
That smoothing has a huge practical benefit: It draws trend lines that are typically a better representation of real underlying trends. Those lines resist "chasing" outlier polls or the variation that's either random and inherent in polling data or that results from differences in polling methodology rather than real shifts in opinion. When they work well, the smooth trend lines help you see real trends more accurately and put new polls into perspective. You can see easily how they compare to the overall trend.
But the loess regression model has some limitations that we have struggled with. The computations generally run smoothly when polls are plentiful, but they sometimes go awry when we have only a small number of polls available. With fewer than 8 polls (the scenario that applies to most of the U.S. House races right now), we do not even attempt to draw loess lines and plot simple linear (straight) trend lines instead. The straight lines produce current "trend estimates" that are no more accurate than the most recent poll, and sometimes considerably less so.
So as of yesterday, we have added an important new first step to this process. The generation of Pollster's trend lines now begins with a statistical tool called a "Kalman Filter," which smooths survey data in a manner that's conceptually similar to loess regression. However, as explained in a helpful 1999 article in Public Opinion Quarterly (Green, Gerber and De Beoff, "Tracking Opinion Over Time"),* Kalman filtering adds some important properties. First, because it sees each data point on the chart as a *survey estimate (with an associated sample size and margin of error) rather than just a number, it provides a means of quantifying the accuracy of the lines -- including the end points that we typically call our "trend estimates." That property is useful in translating trend estimates into probabilities. Second, it provides us with some additional tools (that I'll describe in a successive post) to improve the accuracy of forecasts based on the polling data in the chart. Third, from a purely practical perspective, Kalman Filtering provides a more consistent and reliable process for us to use to generate these charts when polls are sparse.
We have developed a specific Kalman Filter for our charts that is adapted from a model developed by two academic friends-of-Pollster, Jeff Lewis and Simon Jackman. The next few paragraphs get a bit technical (and are intended for statisticians and others who more than the rest of us about how Kalman Filters work -- feel free to skip to the paragraph that starts "in plainer English" below): Our Kalman Filter smooths the polling results by considering (1) how big the sample size is for each poll, weighting polls with fewer responses less heavily and (2) how likely the Kalman Filter "thinks" a race is likely to jump around. The second point, how likely a race is to jump around, is part of Jeff and Simon's model, which estimates the variance of each individual race over time as well as the covariance between races (this is sometimes referred to as the "process noise" or "innovation matrix" of the Kalman Filter).
In other words, if a candidate has had a steady 60% in the polls, and then we suddenly see a poll where he or she has 20%, the Kalman Filter will be less inclined to trust the latter poll, where loess smoothing would have been dragged down. In addition, the Kalman Filter can incorporate the correlation between races into its forecasts. For example, if the filter has learned that Barbara Boxer's scores go up when Harry Reid's scores go up and we see a new poll where Reid is doing well, it may give a slight bump to Boxer as well, even in the absence of new polls (for what it's worth, we are not currently seeing much in the way of this sort of national trend).
One quirk with Kalman filtering, however, is that it moves forward in time, only estimating today's support based on what it saw yesterday and the days before. The unfortunate result is that the results of the Kalman filter are often jagged or abrupt looking. To address this issue, we employ a commonly used technique called Forward Filtering Backward Sampling (FFBS) (Kim, Shephard, and Chib, 1998) to smooth our results. Among other things, the FFBS, can help create simulations of what happens between now and election day that can be used to estimate the probability that a candidate will win a race. More on that to follow in a later post.
In plainer English, that means the Kalman Filter process brings a lot of important "extra stuff" to our charts. There is a catch, though, especially for those of us that have grown accustomed -- for all the right reasons -- to the even smoother lines that Professor Charles Franklin developed for Pollster.com. Even after FFBS, Kalman-Filter output still looks more jagged than our standard chart. Here's an example using the Nevada Senate race:
After pondering this issue, we decided to add an extra step: The standard charts now running on HuffPost Pollster take the slightly more jagged Kalman Filter trend lines and run them through the same loess regression "smoother" model we have been using for the last three years. The net result provides what we consider the best of both worlds, the added properties of Kalman filtering in the form of smoother trend lines that -- when polls are ample as they are currently in the most competitive Senate and Governor races -- are virtually identical to what we've been doing all along. Here is the end result, again, using the Nevada Senate race to show our standard trend estimate line. It should look very familiar:
If you're a data geek and want to see it, we have tucked the raw Kalman Filter output just a few clicks away. Use the Smoothing tool in the chart and select the "More Sensitive" option, as I did to generate the first example above.
When polls are sparse, the lines will look a bit different than what we have been producing, but in a good way. The lines we generate will be more reliable and should better represent the underlying data while also bringing the additional statistical properties described above.
That said, There are two minor issues that regular Pollster readers should be aware of, that we will be working on after the election. First, we were not able to get the Kalman Filtering routine to run efficiently enough to run in your browser to drive the Filter tool. But don't worry, you can still filter out any pollster. The chart will just use the same loess process we have used previously, and the filtered results will be roughly comparable, especially when polls are ample.
Second, because the underlying model starts with the covariation between all races, you will see very small changes in the trend estimates for every race (usually no more than a tenth of a percent or two) whenever we add a new poll to any race.
So far, I am just describing the process that draws the trend lines. Next, I'll describe how we use the Kalman Filter model to generate the race classifications and probabilities.
**Many thanks to our friends at Public Opinion Quarterly and Oxford Journals for providing a free link to the Gerber, Green and De Beoff article.