THE BLOG

The Big Data Nemesis Simplified

10/09/2012 06:25 pm ET | Updated Dec 09, 2012

Big Data is one of the hottest buzzwords in Information Technology based on its one-year hockey-stick trend curve in Google Trends when compared to other hot IT buzzwords. In fact, per Google Trends, its interest level is about 40% of "Cloud Computing," which is quite impressive since Big Data interest is really just in the last year and since "cloud" is another overused buzzword with a five-year-plus trend history. An interesting Google Trend Big Data factoid is that India and South Korea are rated with the highest interest, with the USA a distant third. So, all of the Big Data vendors should now focus on India and South Korea, and leave my email inbox clean.

I hate to admit that I added to the Big Data hype when I blogged on my Top 3 Technologies to Tame the Big Data Beast. In that blog, I provided several key statistics to support why Big Data has become such a hot term in IT, albeit it really exploded since that time. My blog followers have asked me for two things if I did a follow-up blog on Big Data: 1) Provide them some means or tools to help them better understand their Big Data problem; and, 2) Provide more details on how "cloud" helps them solve Big Data problems, since I only provided two or three sentences in the first blog. Here is my attempt to address these two items, and help simplify the Big Data Nemesis.

The Five C's of Big Data

There are many "Five C's" tools available to help one better understand complex (and not-so-complex) domains in the world, including credit, diamonds (I see my wife was successful in officially having "Cost" removed as one of the C's), language learning, leadership, marketing, cinematography, and, of course, the five C's from the Arizona secretary of state. So, not to be outdone, here are my "Five C's of Big Data" to help simplify understanding a Big Data problem.

Capacity refers to the total amount of data that needs to be processed or searched. In Big Data environments, capacity is usually measured in 100s of terabytes, petabytes, and possibly even exabytes or (gasp!) zettabytes in the extreme Big Data environments (i.e., maybe a Google, etc.)

Content refers to the size of individual content or data objects. Sometimes Big Data issues may be caused by a relatively small number of very large data objects. Some domains report individual data objects in excess of 100s of GBs... a few thousand of those, and you have a real Big Data problem.

Collection refers to the degree of disparity versus similarity of objects in a collection of sources. The more diverse the objects in a collection of sources, the more complex the processing environment needs to be to properly correlate data across the sources.

Celerity is a bit of a stretch for a 'c-word,' but it refers to the rate at which data is updated and/or timeliness requirements of data. Performing near-real-time trending analysis of tweets is a different problem than batch-processing for entity disambiguation, and would demand a different framework.

Capability refers to the data exploitation requirements for a collection of data. Depending on what you are trying to get from the data often affects the framework needed.

Unfortunately, five C's just isn't enough to capture everything, but it covers some of the more common decision criteria. As a bonus, here are four more C's for the price of five! Confidentiality refers to data security requirements. Not all Big Data repositories provide cell level security. Confidence refers to being able to track the confidence that automated analytics have made proper assertions. Consanguinity is a stretch in the use of the word, since it typically means blood relationship as in a family, but is used here to refer to pedigree and lineage of data sources. In some Big Data problems, it is essential to be able to track the lineage and pedigree of data across the enterprise; this significantly exacerbates a Big Data problem. Connectivity refers to the degree to which the data sources are available to be accessed. If the connectivity factor of a data source is low, meaning that it is likely to not always be available, then the Big Data framework must be able to handle this additional concern.

Big Data Versus the Data Cloud

There are several "use cases" or types of cloud deployments (I promise a future blog on this topic), but the data cloud is currently the most complex in terms of the number of software components and the decision criteria to analyze to determine what data cloud framework(s) to deploy. The following paragraphs will attempt to demystify what the data cloud is by providing examples of some of the leading frameworks available, and why these frameworks can provide a solution for Big Data problems. It is important to note that, unlike other types (i.e., utility and storage clouds, specifically) of clouds, data clouds are more often private or community clouds rather than public clouds.

Data clouds require analytics framework(s) to make sense of the mass information they hold. Analytics perform complicated algorithms for data source correlation, data efficacy, entity disambiguation, relationship identification, trends, etc. However, the framework often decides how that analytic performs its magic.

Apache Hadoop is an open-source software for reliable, scalable, and distributed computing. Its framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is used to process petabytes of data by some of the best-known web companies, including Yahoo, LinkedIn, Facebook, Twitter, AOL, Adobe, and others. Hadoop is most often used for batch processing of large amounts of data, but offers streaming capabilities in the open source project as well as commercial offerings available that enhance those streaming capabilities.

Twitter's Big Data problems
were more than just allowing its users to track the whereabouts and happenings of Lady Gaga and Justin Bieber. It also provides Twitter Trends by user location, so users know what is HOT in their area in near real-time. Twitter acquired STORM in their 2011 BackType acquisition and now provides STORM as a free and open-source distributed real-time computation system, in addition to possibly powering its own Twitter Trends. Whereas, Hadoop is most often used for processing large, complex data sources in batch mode, STORM is most often used for analyzing smaller data streams in real-time.

Several data cloud repositories exist and each provides unique benefits; for sake of space on this blog, I cover three open source Big Data repositories available as open source. Apache Cassandra was created by Facebook and still used by companies like NetFlix. Cassandra is a key-value data repository that provides a rack-aware highly available service with no single point of failure. Apache HBase is an open-source, distributed, scalable column-oriented key-value store modeled after Google Bigtable. It provides data versioning and is capable of processing billions of rows X millions of columns. Apache Accumulo is an open-source, sorted, distributed key/value store with robust, scalable, high-performance data storage and retrieval system. One of its key differentiators is that it offers secure, labeled access at cell level. Accumulo was developed by a U.S. Intelligence Agency and has been the recent topic of Congressional inquiry on its use within the U.S. Intelligence Community.