Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Hadoop is an open-source software framework that facilitates the storage and analysis of large volumes of data. Inspired by Google papers on MapReduce and the Google File System and managed by the Apache Software Foundation, Hadoop boasts a list of contributors that reads like a who's who of high-tech luminaries. Facebook, Yahoo, Amazon, Adobe, Twitter, IBM and Baidu are some of its pioneering users.
So what's all the fuss about? For one thing, it operates on commodity hardware. Instead of having to buy pricey storage arrays from the likes of EMC, those with large quantities of data can deploy it on cheap x86 servers.
"In striving to store petabytes of data, Hadoop favors storage vendors that produce relatively small, cheap storage devices, not the large-scale data storage vendors," said James Dixon, Chief Geek at Pentaho, an open-source BI vendor that supports Hadoop.
If eating into EMC's hardware business isn't enough, it might also pull the rug out from Oracle's dominance of the relational database management (RDBMS) market. Those who use Oracle or other large database and data warehousing solutions from the likes of IBM, Teradata and SAP face a licensing scheme that adds costs based upon the volume of data being stored and analyzed. The continuing explosion of data generation means that the expense of using these solutions is forcing users to look at alternative approaches.
Take the case of a gaming company that had been using Oracle for many years. Once traffic reached the 100 million to 1 billion impressions per day range, Oracle hit the wall. Even with licensing costs spiraling higher, the RDBMS could only analyze four days of information at a time.
"Given Oracle is trying to corner the market, if Hadoop can get some attention and articulate its story and where they fit, the sky is the limit," said Greg Schulz, an analyst with StorageIO Group.
Part of the reason for the excitement is the massive expansion of unstructured data in recent years, such as blogs, Web pages, email, Word documents, audio, video and texting. Databases like Oracle are designed for structured data. They work well for online transactional processing (OLTP) and online analytical processing (OLAP), which address large quantities of structured data. The design of Hadoop essentially fills in the gaps left by traditional database and business intelligence (BI) tools by enabling rapid consolidation and analysis of structured and unstructured data.
Hadoop's File SystemHadoop itself has two main elements. The Hadoop Distributed File System (HDFS) handles distribution and redundancy of files and enables logical files that far exceed the size of any one data storage device. So HDFS allows you to store a file of many terabytes on a collection of commodity drives. Servers can be added or subtracted rapidly as loads dictate. And the file system is happy to take a mix of servers from different vendors.
"HDFS is designed for commodity hardware, anywhere from five to 10,000 nodes," said Dixon. "It is written in Java, so there are multiple levels of abstraction above the physical storage layer."
This means Hadoop sits above the storage networking hardware and software, the operating system and traditional file systems. Dixon said that if there is support in Java for any given file system, Hadoop will support it.
The other element of Hadoop, MapReduce, takes care of the parallel data processing functions. This element differentiates Hadoop from other distributed file systems in that it isn't just a passive storage system, it is active in the sense that it can both store and process data.
"MapReduce is the Hadoop job execution system," said Amr Awadallah, co-founder and CTO of Cloudera, a provider of Hadoop software and services. "It can run on other distributed file systems, but it works best with HDFS, as it is more tightly integrated with it — there are a lot of cross-optimizations for achieving data locality and fault-tolerance."