A: There was a lot going on in 2008:
- Microsoft tried to buy Yahoo! in February 2008 . This offer catalyzed discussions between Google (Christophe Bisciglia) and Facebook (me) about how to ensure Hadoop development would continue if Microsoft were to buy Yahoo! and end their investment in Hadoop. The first version of Cloudera was discussed with Christophe, Mike Abbott, and me as the founding team and Accel as the VC. Eventually Mike decided to stay at Microsoft and I decided to stay at Facebook. But we kept talking...
- The first Hadoop Summit was held in March 2008 . I had been attending Hadoop meetups around the Bay Area for a little more than a year and it seemed that the community was only a few dozen people. The Hadoop Summit had over 400 attendees, though, and drew people from around the country. I was impressed by the scale of interest in the technology.
- Teradata released the 2500 series of "low cost" data warehouse appliances in April 2008 . Their "low cost" appliance was still $125k/TB! I figured my costs for a Hadoop cluster were easily 1/10th that and could probably be squeezed down to 1/100th that in a year or two.
- We contributed Hive to the Hadoop project in June 2008 . Hive was a proof-of-concept that you could build a data warehouse on top of HDFS and MapReduce. It was ugly but it worked, and for clusters bigger than around 30 nodes it was actually better than anything else we piloted.
- Microsoft acquired DATAllegro in July 2008 . A number of shared-nothing distributed database vendors focused on the data warehouse market got going between 1999 and 2005, including Netezza, Greenplum, Aster Data, and Vertica. DATAllegro was the first to exit, and the price (rumored to be $275M) was higher than most expected. I ran a pilot with every one of these vendors and realized they were immature technologies that couldn't scale. The reference provided for me by one vendor had never installed their software; another corrupted data in the middle of a very simple benchmark workload; and a third crashed on a table name larger than 256 characters. And none were thinking about programmability and non-tabular data.
- Oracle released Exadata in September 2008 . I piloted this product when it was called "Sage". Oracle, the largest database vendor in the market, had focused on a shared-disk approach to scale out for years with Oracle RAC. The release of Exadata was a sign that shared nothing was the right approach for the future.
The confluence of all of these signals made me believe that there was an opportunity to build a low cost data management vendor who could handle more kinds of data, a higher volume of data, and a more complex workload than just SQL queries.The vision is just starting to reach the market with and putting an expressive Python interface on top of , a high-performance distributed query engine, and
- Open source software. We will spend over one hundred million dollars this year on compensation for developers who write Apache-licensed (and often Apache Software Foundation-governed) open source software. If you believe that software is eating the world, you may also believe that open source software is a significant public good. If you're concerned about corporations owning strong AI, you should probably also be concerned with corporations owning the means to store and analyze large volumes of data. By making our core platform open source under a non-copyleft license, Cloudera ensures that any entity in the world can have access to the most powerful tools for data management and analysis at scale.
- : we provide course material and software licenses for free to universities that would like to use our software in their classroom activities.