10/15/2013 12:55 pm ET Updated Dec 15, 2013

Flawed Gap Analysis

I have been stewing on this for a long time but the NJ Senate coverage has really solidified it for me.

Almost all of the poll analysis that gets done in the public sphere examines the horse race gap between candidates and nothing else. There is some talk about whether or not the likely voter models are correct, if some groups are adequately represented or the overall results somehow skewed. Far more often than not the commentary boils down to something like, is Booker up by 20 or 12?

All of that "analysis" perpetuates a fundamental misunderstanding of what a poll might actually say about the expected outcome. The fact that goes without mention is that all undecided voters are not created equally. Focusing on the gap assumes that undecideds will either break 50-50 or not vote. Assuming the same outcome for all undecided voters in every poll is superficial and frankly lazy.

Why lazy? Each poll knows things about the undecided voters in it. All you have to do is ask. Who are they voting for up or down ballot? With which party do they affiliate? What kind of opinions do they hold of the candidates in question? Most of these are knowable and yet ignored by those who write about and aggregate polls. This is not a new concept. The DCCC came up with a vote allocator model for the undecideds because they are well aware of the variability that exists among undecideds from cycle to cycle. It was not rocket science but at least it was an attempt to acknowledge the obvious need to take into account the nature of the undecideds.

Maybe taking these facts about undecideds into consideration makes the models for the aggregators break down or requires too much time, but that is their problem to solve. Those are people with far better analytical brains than I have so I am sure that they could figure it out if they tried. The problem is that no one is urging them to try. Continuing to rate polls and predict winners based on a cursory look at the horse race is everyone's problem. Poll consumers will continue to misinterpret data and draw erroneous conclusions about the institutions producing those data because our prediction models are careless and perfunctory? That seems wholly unacceptable when the data are there to be had.

I will give you a concrete example. In 2010, I had a dozen polls with the Democratic incumbent up by about 4 points, something like 44-40. In every case, a review of the undecideds (hated the incumbent, voting for the GOP governor, Independent lean GOP identified, etc.) told me they were going to break 5-to-1 for the Republican. I knew we would lose all of those races by about seven points, despite the horse race showing a 4-point lead for the Democrat. Gap analysis would conclude those polls had a so-called 11-point bias. Were all the polls that wrong? Did I tell all of the candidates and committees that we had those races in the bag? Of course not, because the data were all there to show me something beyond the toplines. However, if all you saw was the horse race gap, you would walk away not only thinking that the Democrat would win but would also conclude that I had flawed data after the election was over. Neither was ever the case.