06/30/2010 12:41 pm ET Updated Dec 06, 2017

Rivers: Random Samples and Research 2000

Douglas Rivers is president and CEO of YouGov/Polimetrix and a professor of political science and senior fellow at Stanford University's Hoover Institution. Full disclosure: YouGov/Polimetrix is the owner and principal sponsor of

I am, like most in the polling community, shocked by the recent accusations of fraud against Research 2000. Marc Grebner, Michael Weissman, and Jonathan Weissman convincingly demonstrate that something is seriously amiss with the research reported by Research 2000, which may well be due to fraud.

But some of the claims by the critics, such as Nate Silver's post this morning on (as well as part of the Grebner et al. analysis), exhibit a common misunderstanding about survey sampling: "random sampling" does not necessarily mean "simple random sampling." I do not know what Research 2000 did (or claimed to do), but very few surveys actually use simple random sampling.

To recapitulate Nate's argument: if you draw a simple random sample of size 360 from a population of 50% Obama voters and 50% McCain voters, the day to day variation in the Obama vote percentage in the sample should be approximately normal, with mean 50% and standard deviation 2.7%. (Nate gets this by simulating 30,000 polls and rounding the results, but most students in introductory statistics would just calculate the square root of 0.5 x 0.5 / 360, which is about 2.6%.) This would give you the blue line in Nate's first graph, reproduced below.

However, what happens if the poll is not a simple random sample? Suppose (and this is entirely hypothetical) that you polled off of a registration list composed of 50% Democrats and 50% Republicans (to keep things simple, let's pretend there are no independents). Further, suppose that 90% of the Democrats support Obama and 90% of the Republicans support McCain, so it's still 50/50 for Obama and McCain in the population. Instead of drawing a simple random sample, we draw a "stratified random sample" with 180 Democrats and 180 Republicans each day. That is, we draw a simple random sample of 180 Democrats and a simple random sample of 180 Republicans and combine them. What should the distribution of daily poll results look like?

I should caution that there is a little math in what follows, but nothing hard. The variance (the square of the standard deviation) of each subsample is 0.90 x 0.10 / 180 = 0.0005. The combined sample mean is just the average of these two independent subsamples, so its variance is 0.0005/2 or 0.00025, so the standard deviation is the square root of 0.00025 or approximately 1.6%, not the 2.6% that Nate thought it should be. This distribution is shown in the figure below as a green lines, which is a lot closer to the suspicious red line in Nate's graph, showing the Research 2000 results.

Does this absolve Research 2000 of fraud? Of course not. There are other factors (such as weighting) that usually increase the variability, so Nate is right that the Research 2000 results look suspicious. But we should be a little more cautious before convicting upon the basis of this sort of evidence.