The New York 20th congressional district special election is upon us tomorrow. Many commentators, on both sides, have considered it as a referendum of sorts, on Obama, or Michael Steele, or on Congress (either side) or on the stimulus or the budget or the war on terror. Take your pick.
In my more parochial world of polling analysis I tend to see it as a test of how we think about polling aggregation. Should we focus on estimating the current positions of the candidates based on trends or should we use simple averages of recent polls? RealClearPolitics has always preferred the simple averages, though so far as I can find they haven't posted a NY-20 average. I've always preferred trend estimates because they capture the all important dynamics of the race. The polling in NY-20 this time presents the difference in these two aggregation approaches rather vividly.
The chart shows the five polls we have for the race along with the trends and the simple averages (as diamonds on the election day line to the right.) This is a race in which the trends are both very strong and the polls surprisingly consistent with those trends. The better known political veteran, Jim Tedisco, started out at 50% support to novice Scott Murphy's 29%. That was February 3-4 polling. Since then Tedisco has slipped a little to 43% while Murphy has surged to 47% in the Siena poll done March 25-26. That's a lot of movement for any campaign. This last poll shows Murphy leading for the first time. (The Tedisco campaign claims their internal polls continue to show him in the lead. A Democratic source claims their recent polls show a 2 point Murphy lead. See that part of the story here.)
What I'm interested in is the different conclusions we reach based on simple averages versus trend estimates, and when they may differ from each other.
When a race is stable, with no trend either way, then the trend estimate and the simple average will agree with each other. A "flat" trend is just the average. But when we have sharp trends, as in this race, the two can differ a lot. The simple average of the five available NY-20 polls has Tedisco at 45.6 and Murphy at 37.6. In contrast, the current trend estimate reverses this with Murphy at 47.0 and Tedisco at 42.4. Which should we believe?
What makes the trend persuasive in this case is how consistent the polls have been in showing Murphy's gains and Tedisco's (more moderate) declines. While there are only five polls, the results for both candidates are remarkably close to the trend lines. And the trends aren't affected by dropping any single poll. The NRCC poll done by Public Opinion Strategies in early February is just as close to the trend lines as the DCCC poll done by Benenson in late February- they differ due to the trend, not due to one being out of line with the rest of the data. Likewise the three Seina polls have shown the same close match to the trend line and a steady gain for Murphy and slow decline for Tedisco. It is this consistency across polls that makes the trend compelling.
Imagine if we mixed up the order of the polls. Ignoring order we have Tedisco at 45, 50, 43, 46 and 44. An average of 45.6 +/- about 3. Likewise ignoring order we have Murphy at 41, 29, 47, 34 and 37. An average of 37.6 +/- about 9. If this random order were really what we were seeing, we'd be justified in using the simple averages, and would want to comment that while Tedisco's results are within sampling error, Murphy's are considerably more variable than we'd expect. That would be a story of noisy polls, and the best we could do is the simple average (with due note of the noise level.)
But these really aren't noisy polls. If we calculate the deviation of polls from trends, we get a surprisingly small range: only 3.4 points for Tedisco and 0.8 points for Murphy. That's another way of saying the points are all really close to the trend lines, especially in Murphy's case. That's very different from the conclusion we'd reach based on the hypothetical unordered sequence of the previous paragraph.
When we have sharp trends like these, the simple average will lag behind the current trend because the average is ignoring the order of polls, while the trend uses that order as a central element for understanding where the race stands. The trend estimate is always an estimate of where the race was when the last poll was taken, while the simple average is an estimate across all polls regardless of date. In the case here, it is clear we get very different conclusions between the two methods of aggregating polls.
With so few polls, the linear fit is more reliable than the local regression trend I normally use. For less than about 15 polls the local trend can be affected quite a bit by an outlier or two. But in this case I stretched the point and show the local fits in the chart as well. It is clear that the local trend follows the linear trend quite closely, with a small exception of the Tedisco result for the Benenson/DCCC poll. I'd make nothing of this because there isn't enough data for the local fit to be reliable on it's own. The fact that it isn't far from the linear just confirms the obvious-- the polls tend to follow a straight line trend quite well.
So this leaves us with two aggregations that predict different results tomorrow night. The trend estimates have the race tighter, and have Murphy ahead. The averages reverse that order with a wider margin for Tedisco. It is certainly a stretch to say that Murphy is a lock. There are just five polls, even if the trend is strong. And my colleague Mark Blumenthal posted here last week on the difficulties of polling in special elections. I tend to side with experienced candidates over novices, so I'd give the benefit of the doubt to Tedisco on that score. But the trends are quite consistent and point to at least a close finish and a modest advantage for Murphy. We'll see tomorrow night.
(You can also check the NY-20 race in our usual interactive chart here.)
How will Trump’s administration impact you? Learn more