THE BLOG
10/10/2014 10:57 pm ET Updated Dec 09, 2014

Innovations in Science: Managing the 'Billions and Billions' of the Data Deluge

It's pretty safe to say that, unless you've been living in a cave for the past 10 years, your daily actions and behavior have contributed to big data at some point, in some capacity or other.

The term "big data" seems to have evolved over time. The appearance of the term in an October 1997 article by Michael Cox and David Ellsworth, "Application-controlled demand paging for out-of-core visualization," is certainly one of the first.

But big data is by no means new. Science research has been generating and publishing large volumes of data sets for well over 150 years. Noted astrophysicist Carl Sagan alluded to numerous examples of it in his final book, Billions and Billions, published well over 15 years ago. But the development of the Internet and the shift from an offline world to an online world opened the floodgates, creating a data deluge that continues to grow at an astounding rate. In the latest Digital Universe study, technology analyst IDC reveals that the digital universe is expected to double every two years and will multiply 10-fold between now and 2020 -- from 4.4 trillion gigabytes to 44 trillion gigabytes. To give you a sense of scale, if you imagine each byte of data as equal to one inch, it would be like 1 million round trips between Earth and Pluto.

Big data has certainly arrived in a big way, but big insights on effective management of it all lag behind. In my mind, big data has reached two frontiers. The first is privacy. The recent decision by the European Court of Justice on the "right to be forgotten" set off a firestorm of debate. Does the public really have a right to know everything? Or does the right to privacy trump freedom of information? Scientific research is disputed on a regular basis, and those disputes are considered part of the discovery process at large, but what if a researcher wanted to delete all references to heavily disputed papers in order to protect his or her reputation? Should this be allowed? What policies must be developed to balance big data collection and reputation? Is that even possible?

The second frontier is the analysis and management of research data to unleash the power of the information that it holds. A recent editorial in Big Data Research argues that the true value of big data lies in knowledge derived from analysis. By its very nature, scientific big data is self-perpetuating. Researchers generate multiple data sets as a byproduct of their own research, and those data sets are then used and cited in other research. Digital information solutions providers like Elsevier, my employer, sit on vast databases of high-quality scientific, technical and medical research content that has been collected, curated, aggregated, disseminated and published for more than 10 decades. With so much scientific big data available -- and increasing as we speak -- finding the right information is akin to looking for a needle in a pile of needles. Without sophisticated analytics, a gigantic pile of data without a framework to provide meaning or context is just that: a pile of data, not particularly useful in itself.

Sources of research information are also expanding and include social media streams, images, audio and video files and crowdsourced data. We're also seeing bigger files (seismic scans can be 5 terabytes per file) and massive numbers of smaller files (email, social media, etc.), all potentially valuable data when processed alongside other appropriate and relevant sources. New capture, search, discovery and analysis tools can provide insights from the increasing pools of unstructured data, which account for more than 90 percent of the digital universe. It is crucial, then, for those of us in scholarly publishing to help researchers find relevant data quickly through smart collection tools, recommended reading lists and data banks that offer a variety of sort and search applications.

Intelligence around data has been part of the information and communications technology at the core of business intelligence (BI) and data warehousing applications for over 20 years. However, it is limited in scope and typically thought of as being retrospective, dealing with only structured data to analyze what has already happened. Big data, by contrast, can be prospective. Through the use of advanced analytics and predictive tools, it projects potential outcomes by studying structured data as well as an increasingly expanding pool of unstructured data from multiple sources, and then sharing the results across collaborative platforms for meaningful correlations. It can reveal such insights as:

  • "This happened because of..." (i.e., diagnostic analytics)
  • "What will happen if...?" (i.e., predictive analytics)
  • "This can happen / can be avoided by doing..." (i.e., prescriptive analytics)

For example, a study from the McKinsey Global Institute estimates that if the U.S. healthcare sector can exploit and process the vast ocean of electronic information at its disposal, such as data from clinical trials and research experiments alongside insurance data, the effective management of such information has the potential to improve the efficiency and effectiveness of U.S. health care by more than US$300 billion a year. Strategically administered, then, big data is able to harvest previously unknown but useful information and insights in order to spark ideas that drive new discoveries, hence fueling the cycle of academic research.

Recent big data management and processing systems, such as high-performance computer cluster (HPCC) systems and Apache Hadoop, are able to correlate and analyze large and varied types of data sets. Developed by LexisNexis Risk Solutions, HPCC is used to solve complex data and analytics challenges by combining proven processing methodologies with propriety linking algorithms that turn data into intelligence that can be applied to improving the quality of research outcomes. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. The harvesting of big data by such methods enables information providers to facilitate information exchanges that create opportunities for serendipitous discoveries by breaking down the discipline silos. In short, information about information is gold.

As with all new technology, we do not and cannot know what can be fully achieved with big data solutions. As the century advances, the "billions and billions" of data will quickly grow into "trillions and trillions"; therefore, when talking about big data, it should be articulated that big data is itself a journey of discovery.

Carl Sagan loved discoveries. I like to think he would have been pleased.