Hadoop is an open-source software framework that facilitates the storage and analysis of large volumes of data. Inspired by Google papers on MapReduce and the Google File System and managed by the Apache Software Foundation, Hadoop boasts a list of contributors that reads like a who’s who of high-tech luminaries. Facebook, Yahoo, Amazon, Adobe, Twitter, IBM and Baidu are some of its pioneering users.
So what’s all the fuss about? For one thing, it operates on commodity hardware. Instead of having to buy pricey storage arrays from the likes of EMC, those with large quantities of data can deploy it on cheap x86 servers.
“In striving to store petabytes of data, Hadoop favors storage vendors that produce relatively small, cheap storage devices, not the large-scale data storage vendors,” said James Dixon, Chief Geek at Pentaho, an open-source BI vendor that supports Hadoop.
If eating into EMC’s hardware business isn’t enough, it might also pull the rug out from Oracle’s dominance of the relational database management (RDBMS) market. Those who use Oracle or other large database and data warehousing solutions from the likes of IBM, Teradata and SAP face a licensing scheme that adds costs based upon the volume of data being stored and analyzed. The continuing explosion of data generation means that the expense of using these solutions is forcing users to look at alternative approaches.
Take the case of a gaming company that had been using Oracle for many years. Once traffic reached the 100 million to 1 billion impressions per day range, Oracle hit the wall. Even with licensing costs spiraling higher, the RDBMS could only analyze four days of information at a time.
“Given Oracle is trying to corner the market, if Hadoop can get some attention and articulate its story and where they fit, the sky is the limit,” said Greg Schulz, an analyst with StorageIO Group.
Part of the reason for the excitement is the massive expansion of unstructured data in recent years, such as blogs, Web pages, email, Word documents, audio, video and texting. Databases like Oracle are designed for structured data. They work well for online transactional processing (OLTP) and online analytical processing (OLAP), which address large quantities of structured data. The design of Hadoop essentially fills in the gaps left by traditional database and business intelligence (BI) tools by enabling rapid consolidation and analysis of structured and unstructured data.
Hadoop’s File System
Hadoop itself has two main elements. The Hadoop Distributed File System (HDFS) handles distribution and redundancy of files and enables logical files that far exceed the size of any one data storage device. So HDFS allows you to store a file of many terabytes on a collection of commodity drives. Servers can be added or subtracted rapidly as loads dictate. And the file system is happy to take a mix of servers from different vendors.
“HDFS is designed for commodity hardware, anywhere from five to 10,000 nodes,” said Dixon. “It is written in Java, so there are multiple levels of abstraction above the physical storage layer.”
This means Hadoop sits above the storage networking hardware and software, the operating system and traditional file systems. Dixon said that if there is support in Java for any given file system, Hadoop will support it.
The other element of Hadoop, MapReduce, takes care of the parallel data processing functions. This element differentiates Hadoop from other distributed file systems in that it isn’t just a passive storage system, it is active in the sense that it can both store and process data.
“MapReduce is the Hadoop job execution system,” said Amr Awadallah, co-founder and CTO of Cloudera, a provider of Hadoop software and services. “It can run on other distributed file systems, but it works best with HDFS, as it is more tightly integrated with it — there are a lot of cross-optimizations for achieving data locality and fault-tolerance.”
Cloudera and Pentaho Build on Hadoop
Mike Karp, an analyst with Ptak, Noel & Associates, cautions that any kind of open-source software is by its very nature a double-edged sword: Cheap to implement, but often hard to find adequate support, especially in the early stages of adoption.
“Most of where the support would come from, after all, is a group of volunteers; as a result, companies are often nervous about doing open source code with business-critical applications,” said Karp. “The good news, of course, is that these volunteers frequently are often inspired to write great code, and there’s plenty of evidence in the past that open-source projects have achieved great success.”
That’s where companies like Cloudera and Pentaho come in. They build a business model around taking top-notch open source software and supplying the bells and whistles to make businesses trust it in an enterprise environment. Cloudera provides a Hadoop-based data management platform for the enterprise. Its founding team came from Web companies such as Facebook, Google and Yahoo. It offers services, support and training, and its largest customer deployment in production is more than 4.5 petabytes, running in more than 500 servers.
“Cloudera is interesting because it intermediates between the open source world and the users in much the same way that Red Hat or SUSE do in the Linux world, which is to say it provides much greater assurance in terms of support for Hadoop in critical environments,” said Karp. “This will be particularly important as cloud and locally virtualized environments make flexible data processing more important.”
Business Intelligence
One of the primary value propositions is adding customer value. Early adopters like Google, Amazon and Facebook used Hadoop to unlock the enormous value buried in the massive amounts of data they collected. By using analytical techniques to comb through data at volume, these companies deliver a better customer experience: on-target search results, more interesting products, better content and more precisely targeted ads.
“We are seeing a lot of traction in financial services, government, telecom, research institutions and other markets where a lot of data is involved,” said Awadallah. “Credit card companies, for instance, are using it for things like fraud detection.”
Top management, too, is beginning to realize the potential that might be sitting stored and unutilized within the enterprise. Hadoop appears to be the right tool at the right time to allow organizations to triangulate what people are doing at their sites in order to do a better job of turning prospects into customers, offering them what they want in a timely manner, spotting trends and reacting to them in real time.
While Cloudera offers an enterprise-ready distribution of Hadoop, Pentaho supplies integrated BI tools based on the Lucene/Solr open source search technology. The depth to which users can drill down depends purely on the ability to store the lowest levels of data in a format that can be queried. Drilling down is typically not hard with a BI application based on transactional data. However, digging into systems based on blog, social media or telco data is a different story. Some of these data sets include billions or records per day. They are too large for most relational databases and too expensive at this scale of storage. The solution has been to aggregate the data as it comes in, and being unable to store it all, throw the raw data away.
“Hadoop provides the ability to store and process volumes of data, but lacks graphical interfaces for loading, transforming, modeling or visualizing this data,” said Dixon. “Pentaho provides these other capabilities and will enable more companies to use Hadoop to solve large-scale data problems.”
He sees this as good for the storage networking industry as a whole. Reason: As more companies use Hadoop we will see an increase in the amount of data stored. However, it might eventually hurt the sales at the high-end of the array market.
Follow Enterprise Storage Forum on Twitter