All Is Not Well (in the World of Big Data Analysis)

07/01/2015 05:41 pm ET | Updated Jul 01, 2016

The "big data" revolution set out to give us the tools needed to exploit the endless growth of data, whether inside companies, government agencies, or in open source. With it came increasing demands for analysis and insights, begging IT departments to make good on the promises of the revolution. If there's one thing the last few years have proven, it's that the challenges we're facing aren't merely technological--in fact, they rarely are these days--but rather reflect issues in how we frame the questions we're asking, how we apply the right techniques and technologies to answer those questions, and, in general, how we fail as humans to correctly interpret and act on results.

As one might expect, governments are at the heart of trying to figure out what to do as things get harder, not easier. Whether it's trying to predict foreign election results, track down terrorists, prosecute insider trading, or understand what's happening on the ground after a natural disaster, the overwhelming amount of data available is paralyzing. Combine this with the data analog of the CSI effect--where policy makers and citizens alike expect brilliant insights and concrete answers, beautifully visualized, in moments--the endless troves of data are more of a curse than a blessing. Technological innovation has delivered so many amazing capabilities and tools that today's core problems are now about asking the right questions, understanding gaps, and interpreting results. Really effective analysis combines brilliant technologists and cutting-edge code we all recognize with human understanding, social science research, philosophy, and mission expertise.

Whether voraciously reading tweets or sifting through the world's print media, it isn't always apparent which sources are trustworthy and which are unreliable or satirical. It can be embarrassing when unreliable content is taken seriously--think The Onion and Kim Jong Un. When the most likely open sources were the New York Times and academic journals, these problems were present, but far easier to overcome. Keeping track of a handful of publications and their journalists has gone from manageable to insane. Just this month, a man in London sarcastically tweeted about a battle in Iraq that never occurred, drawing claims of victory, worried responses, and analysis from ISIS supporters and opponents on the ground in Iraq. Shichwa isn't even a real place: he named his fake battle after the local word for "cheese bladder." Perhaps more interestingly, the battle wasn't just celebrated and lamented, it was enriched, with battlefield maps, strategic analysis, and even photos by other users, adding superficial credibility and content to an entirely fictitious skirmish.

CIA Director Brennan identified this as a core concern to the world's greatest consumers of data, the US Intelligence Community (IC), saying, "we have to understand what is the reliability, the integrity of the data." Achieving that understanding in an automated way and managing those assessments over time will be one of those intractable challenges that the IC must eventually overcome, but no easy answers are available.

Even if the data is present, easily found, and verified in its authenticity, the challenge lies in how it is properly used--and further, what that even looks like. Facing a variety of sources, with a multiplicity of critically important civil liberties and privacy protection regimes, it's all too easy to cherry pick. These kinds of biases can range from the subtle--where the human nature of an employee leads to the selection of the easiest sources, or the ones with the least privacy or classification protections--to the insidious, where analysts select sources that support their view of a situation, ignoring dissenting opinions. It is horrifyingly easy for analysts to wade into big data holdings and find information to support whatever argument they're trying to make.

In the end, high-quality analysis and conclusions come from trusted analysts, employing vetted sources in sound ways. The volume of social media and open source data combined with scary technology terms makes these challenges seem new and hard, but government agencies around the world already deal with aspects of these issues every day. Today's data reliability demands that agencies innovate, finding novel ways to pair analyst experience and expertise with automation, overcoming the velocity, volume, and variety of data they see everyday.