09/09/2016 05:42 pm ET

Here's How HuffPost Averages The Polls And Figures Out Who's Ahead

It's actually a bit more technical than an "average."

Getty Images
Everybody wants to know what the polls say about Donald Trump vs. Hillary Clinton. 

The approach of the November election brings with it an endless tide of new polling data, which The Huffington Post uses to calculate who’s ahead in the big races. Readers may wonder how does the HuffPost Pollster model work and why isn’t our estimate of Hillary Clinton vs. Donald Trump exactly the same as that of other polling aggregators.

Here’s how and why.

The HuffPost Pollster charts for this year’s general election contests estimate the “average” of publicly available polls that meet our criteria for quality polling. HuffPost uses a Bayesian Kalman filter model, which we initially introduced in 2010 and have modified since to reflect the changing polling environment.

Briefly, Kalman filter models combine data that are known to be “noisy” ― or not completely precise ― into a single estimate of the underlying “signal” ― that is, what’s actually happening. For HuffPost, that means the model looks for trends in the polls and produces its best estimate of the polling average.

More technically, what the model calculates is a trend line estimate, not an average as most people think of it. The reason is that a simple average would be very susceptible to the deviations of individual polls. One outlier poll could pull the average in a completely different direction than the rest of the polls indicate. Our algorithm resists this tendency by requiring a trend in the movement of multiple polls in order to change the direction of the aggregated estimate.

That means the model essentially downplays a single poll’s results that deviate substantially from what most of the other polls show. However, if more data come in and indicate that the deviant results are actually part of a trend, the trend line adjusts to accommodate that new information. So the model doesn’t ignore outlier polls; it just downplays them unless it becomes clear that they’re actually the beginning of a trend.

In the national Donald Trump vs. Hillary Clinton chart, for example, the one early September CNN poll that showed Trump leading was basically treated as deviant data and didn’t have as much influence on the estimate as other recent polls. But if more polls come in showing Trump ahead or Clinton’s lead narrowing, the trend line will adjust to reflect the GOP nominee closing in.

You can see how this dynamic played out in August: Clinton’s wide lead eroded as many polls found Trump rebounding. Clinton remains ahead by 5 percentage points in early September, but that’s down from a 8-point lead in the middle of August.

The advantage of using a Kalman filter model is that it doesn’t swing wildly in response to outlier polls. The disadvantage is that it’s slower to react to real polling changes than a traditional average. Under normal circumstances, the estimate for each candidate can move up or down about 1.5 percent from the previous day’s estimate. But the model allows about a 5 percent chance a candidate’s numbers could move more than 1.5 percent in either direction in a single day. So if four or five polls all showed a dramatic shift in the same direction within a day, the candidate’s estimate could shift more than 1.5 percent.

That means that we currently show a wider race between Trump and Clinton than do simple averages, such as the model used by RealClearPolitics. The HuffPost trend line will favor Clinton when the polls have generally favored Clinton and the polls showing Trump ahead ― or within a couple of points ― appear to be deviations. The model would be similarly sticky on a Trump lead if the polling fairly consistently pointed toward the GOP nominee.  

On a more technical note, because of the calculations the HuffPost model has to make, we don’t run it until we have five or more polls for a particular contest. The model runs 100,000 simulations of the data on each date of polling to find the most likely average, and it needs at least five polls to do that reliably.

The model begins running simulations to calculate a candidate’s estimates on the first date of the first poll. It incorporates the polls available for each subsequent day, pulling in additional surveys as it continues toward the current date — at which time all of the polls (that meet HuffPost’s criteria) are being considered. Newer polls are more influential in a given day’s average than older polls, because older polls are inherently less reliable, more uncertain measures of the current state of the race. But there’s not a specific cutoff date for excluding an older poll. If there aren’t any recent polls, the “error bands” (those shaded regions around the trend lines that show the range within which 95 percent of the simulations fell) will be wider to indicate the uncertainty.

The HuffPost model also weights polls by their sample size, because larger samples have smaller margins of error. But any polls with samples above 3,000 are weighted as if they had no more than 3,000 so that the very large surveys that internet pollsters are able to produce don’t dominate the model. 

Once all the simulations have run, the model averages their results to produce a estimate of the percentage support for each candidate (including “other”) on each date, in addition to estimates of the percentage of undecided voters and the margin between the candidates. The probability that the leading candidate actually leads ― which the chart describes as “very likely,” “likely” or “probably” ― is determined based on the margin between the candidates.

Finally, for our primary election, job approval and other non-election charts, and for the chart customizations on the election charts, we continue to calculate the trend estimates using a loess regression. The principle is similar to the Kalman filter model in that a loess regression uses available polling data to estimate a trend of where the polls are on average, but the latter model is much simpler and doesn’t rely on thousands of simulations. Instead, it runs a regression analysis on the polls within a certain range of each date to calculate the trend line. More information about that can be found here.