There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.
Studies, data and research are finding their way into our headlines at an ever-increasing rate: Google Trends shows an enormous spike over the past two years in how often the term "new study" is mentioned online in news articles and on the Web.
An amazing increase along with lots of seasonality (during the US summer time it appears we're not talking about new studies very much) -- this is an output of hardworking academics, executives and their public relations flacks. What do hardworking PR people do during the summer you might ask, if they don't have new studies to push? Well, it seems they increasingly enlist graphic designers to recycle all those statistics into (mostly awful) half-page wide by 4-pages-high "infographics". Probably not what Edward Tufte had in mind.
Google Trends shows very little increase in any discussion of terms like "statistically significant", "false positive" or "false negative", even though you might expect it given the growth in the underlying output of new statistical information. While there is far more discussion of studies, data and infographics flying around, we might be somewhat constrained when it comes to numerate analysis thereof.
This data is no longer coming just from the academic side, but a concerted push for publishing data and studies is now coming from enterprises too that are generating a lot of "big data" as a side-effect of their operations. I believe a problem arises, however, when those companies try to take the slice of the market data (e.g. analysis of Tweets being posted by companies using their publishing platform) they see and claim it as representative of the whole market.
Last week I wrote an opinion piece in Business Insider about methodology issues with a recent analyst report, where an analyst at a leading research firm (who should know better) had his PR team aggressively pitching a story calling into question the efficacy of a marketing channel based on satisfaction "data" that wasn't statistically significant, even if you thought the phrasing of the survey questions was valid (which was questionable).
Here are thoughts on why some of this pseudo-data is getting traction and how we can be more critical of it:
Publishing is cheaper than ever. The cost of pushing this information out on a blog or website is close to zero. More people can instantly "publish", often cutting out disclaimers when researchers DO choose to put them in their research. Big publishers are also more often asking contributors to blog for them, and lending their brands/imprimatur to these contributors.
Viral loops and quick insta-referencing. It's never been easier to prime the echo chamber. Facebook, Twitter, Reddit, Buzzfeed and even LinkedIn are enormous sources of article sharing, and are the way some people get all their news. Journalists tell me they know a lot of people retweet headlines when they don't even read the articles. The lowest common-denominator can win: the most referenced and not necessarily best-written or most accurate articles rise to the top.
More science going into headline-writing than ever. CNN was reportedly looking to acquire Mashable in 2012. Among Mashable's assets some claimed their methodology for finding and promoting the catchiest headlines is second to none. Techmeme now writes their own headlines because they know better than the article authors themselves how to generate clicks.
Research details and methodology often hidden from public view. AirBnb trumpeted the highlights of a study they'd funded from HR&A Advisors which estimated $632 million of economic impact of AirBnB activity in New York, but as far as I'm aware (and I asked on Twitter!) neither company has made the full study available to the public.
Similarly, analyst firms often offer to send a full study to the reporters they approach to write about it, but might simultaneously be selling it to businesses for hundreds or thousands of dollars. Most journalists don't have the time to debunk their shoddy research, but if we're lucky they're the only ones given the chance. If it's put into the public eye, especially by public relations teams pushing it to journalists to write about, the whole report should be made public, if not also the underlying datasets.
Not wanting to waste data. This is a subtle but insidious issue -- especially among companies that pay for consumer survey research. Firstly, many surveys are poorly written, and subtle wording shifts can entirely alter survey results. Secondly if it fits our pre-existing hypothesis, then confirmation bias suggests we'll go with it rather than waste that data! Also another reason we'll continue to see a lot of data recycled again and again in infographics, unfortunately.
Numeracy. For whatever reason(s), adult Americans have our challenges when it comes to math and statistics. The OECD ranked the US third from bottom on "numeracy" in its 2013 Programme for the International Assessment of Adult Competencies (PIAAC).
There could be a commercial opportunity for third-party verification services to give some comfort on the veracity of statistical data that gets published for laypeople -- you could think of it as a "ratings agency" for statistics, or perhaps more aptly a "warning label" or guide for the impressionable. The trends seem to suggest we need something like this -- if we're not to drown in bad research, questionable statistics and trumped up findings.