Unlike people, not all data is created equal.
That's one reason why Yahoo! engineers created Hadoop and Yahoo! built it out in 2008 as the world's largest Hadoop production application. At the time, Yahoo!'s search webmap ran on "10,000 core Linux clusters."
This year, Facebook, which is a contributor to the open source development community of Apache Hadoop, surpassed 100 petabytes of storage as the world's largest cluster. But can FB leverage its data and monetize it quickly, in real-time, while keeping advertisers happy?
We know search (mobile) and communication (text) habits are changing at a breakneck pace. See Facebook's missteps in mobility. Yet the disruption being carried out isn't being done by startups, but by consumers who are rapidly embracing the new, found freedom: Light, on-the-go, mobile technologies.
But making data fertile, accessible, and actionable is making it useful and seamless, to be seen in real-time. That's no easy task. It's harder in the ecommerce space of small-to-medium size businesses (SMB). Pulling unstructured data, which is known for its three Vs -- volume, velocity, and variety -- is a steep challenge for new and legacy enterprises.
It is also the reason why Hadoop has become an integral part of an ecosystem of new databases and languages. And why Infochimps, Inc., an emerging leader in big data, has placed it in the center layer of its Platform-as-a-Service (PaaS) solution.
Infochimps' Unique Platform in the Cloud
When I sat down with Infochimps' CEO Jim Kaskade and CTO/cofounder Philip "Flip" Kromer, at the Strata Big Data Conference in New York City -- a week before Hurricane Sandy cut power to several downtown datacenters -- I saw the yin and yang of a rising technology company, executing three subtle driving forces: Business development, market opportunity, and engineering knowhow, which align these executives' different personalities, and bridge the "gaps" between technology and business that empowers Infochimps to grow inside previously empty verticals.
Hadoop, for all its praise within the tech industry, has a complex developer experience and a massive operational burden. Its batch-oriented nature limits it from spitting out key data droplets from an ocean of information in real-time.
Recognizing the opportunity around the upcoming tidal wave of data created by mobile applications, Infochimps created a comprehensive big data toolset that would manage it in a very efficient, speedy way -- they had a small team size with a massive amount of data under management.
"We had to solve expert problems, but the only folks we could afford to hire back in 2009 were sophomore CS students. So we built a platform that turned sophomore CS students into expert data engineers. As we've learned, it can also turn PhD statisticians and app developers into expert data engineers," said Kromer.
He architected the system around a cloud orchestration layer, a data infrastructure layer, and an application infrastructure layer. Infochimps cloud services include application intelligence solution called Dashpot™ for BI data visualized on dashboards; IronfanTM, which is the resource orchestration solution based on chef; and Data Delivery Service™ (DDS), for data integration and real-time analytics.
Hadoop and NoSQL databases sit in the center tier of Infochimps' new PaaS system, and are managed by Ironfan.
"Where did the name Infochimps come from?" I asked.
Flip Kromer deferred to CEO Jim Kaskade, who came onboard only 60 days ago to take Infochimps from a post-market startup and build it into a high-growth big data cloud company. As an Entrepreneur-in-Residence at PARC, a Xerox company and leading their big data program, Mr. Kaskade is no ex-marketing guru from a Fortune 500 company. That's a bonus, as he can grasp the science behind Infochimps and position it for market growth.
"Infochimps' name came from Aristotle's metaphor -- that if you had an infinite number of chimps typing on a typewriter, one day they would surely reproduce Hamlet," Kaskade explained. "This is a metaphor for an infinite amount of compute resource (a.k.a. the cloud) analyzing your data to produce statistically significant insights. We operate under this thesis to turn any data into insights for any user or any enterprise on their own terms."
"We're big data for chimps," Kromer added.
As the CTO explained his background in studying computer science and physics at Cornell University and UT-Austin, Mr. Kaskade sketched a series of flow charts and schematic diagrams, showing where Infochimps falls in the big data competitive matrix, its paths to market, and the disruption between old versus new domains.
In the latter, the CEO explained the old database of "'schema on write,' which can take up to two years to launch as part of an Enterprise Data Warehouse effort involving logical and physical data modeling, heavy ETL (Extract, Transform, Load) compared to 'schema on read' with Hadoop, which can take as little as one month to deploy using the Infochimps Big Data Cloud." It's the classic hare versus turtle race. Such efficiencies in productivity are only one benefit to the user.
This example shows why legacy architecture is giving way to the new, elastic, pay-as-you-go infrastructure of the cloud.
"We're enabling enterprises to leverage multiple data sources with our cloud services," Kaskade said. "For example, one is sourced from API-driven social media, while the other is sourced from existing databases within the enterprise. We're seeing many use-cases where structured and unstructured data are combined to improve insight."
Harnessing the Power of Twitter and LinkedIn
In a recent Tech Crunch blog, Infochimps' product manager Tim Gasper tells us that Infochimps will be the first to offer Storm/Kafka, the best-in-class streaming analytics platforms, as an enterprise service. Storm and Kafka, open-source projects from Twitter and LinkedIn, respectively, "form the best enterprise-grade, real-time ETL and streaming analytics solution on the market today.
"Ultimately, Storm and Kafka form the best enterprise-grade, real-time ETL and streaming analytics solution on the market today," Gasper wrote. "Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day -- in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, the Weather Channel, FullContact, and many others.
"Storm and Kafka are also great at in-memory analytics, and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs."
Big Data's Post Media Platform
In another schematic, Mr. Kaskade drew a column from old media, TV and radio, down to new media of Twitter, Facebook, and blogs, and how the mashing of complex data is processed by activities, authors, topics, clients, and campaigns and then run through Hadoop-NoSQL databases and APIs into a developer friendly environment that can scale.
By providing real-time data and uncovering never seen before trends, companies large and small will save money on a factor of 10x, and have data that's actionable in an evolving stream of information, whether its project based, brand, or consumer.
On a final graph, Mr. Kaskade showed where Infochimps sits in the current landscape of big data/predictive analytics companies, with a continuum from real-time to batch on the vertical axis, and enterprise to SMB on the horizontal axis. He said, "In five years, Infochimps will grow to address both batch and real-time, and expand from addressing the needs of Enterprise Fortune 1000 into the SMB space using our cloud services. We believe that companies like Oracle and Teradata will find it difficult to complete due to time-to-market and price pressures associated with the cloud."
Database technology is going through a revolution. To that end, Flip Kromer, who will soon release his book, Big Data for Chimps, said, "People are important, robots are cheap."