Gartner lists Big Data (BD) as one of its “Top 10 Strategic Technologies.”
“Big data is a topic of growing interest for many business and IT leaders, and there is little doubt that it creates business value by enabling organizations to uncover previously unseen patterns and develop sharper insights about their businesses and environments,” says David Newman, research vice president at Gartner.
According to Deloitte, “more than 90 percent of the Fortune 500 will likely have at least some BD initiatives under way” by the end of this year, at a cost of $1.3 billion to $1.5 billion. But those figures just bring up the billion dollar question, what exactly is “Big Data”?
“If you think the term ‘Big Data’ is wishy-washy waste, then you are not alone,” says Forrester analyst Mike Gualtieri.“Many struggle to find a definition of Big Data that is anything more than awe-inspiring hugeness.”
He says that Big Data can be considered in terms of the volume, velocity and variety of data, but that “there is no specific volume, velocity, or variety of data that constitutes big.”
“One organization’s Big Data is another organization’s peanut,” says Gualtieri. “It all comes down to how well you can handle these three Big Data activities”: the storage, processing (cleansing, enriching, calculating, analyzing) and querying of the data.
Effectively managing Big Data to obtain the desired business results, therefore, requires a rethinking of every aspect of the way that data is stored and used. “Big data disrupts traditional information architectures — from a focus on data warehousing (data storage and compression) toward data pooling (flows, links, and information shareability),” says Newman.
Here are some of the approaches storage vendors are taking to create Network Attached Storage (NAS) products that meet the needs of Big Data.
While most NAS vendors use disks for storage, Crossroads Systems, Inc. of Austin, TX, also uses tape as an online storage mechanism in its StrongBox systems.
“Crossroads StrongBox is an online all-the-time, fully portable data vault for long-term data retention. It leverages Linear Tape File System (LTFS) technology, providing the first-ever enterprise-level, file-based tape archive solution with built-in data protection and self-monitoring for optimized performance at a significantly reduced cost,” says Senior Product Manager, Debasmita Roychowdhury. “It also incorporates disk for fast file storage and retrieval, and physical tape for cost effective, long-term, reliable capacity storage.”
Crossroads has two rackmount versions. The 1U T1 box supports a file transfer rate of 160 MB/s and the 3U T3 has a file transfer rate of 600 MB/s. Both models use a mix of SATA disks and LTO5 tape drives. But unlike tape archive systems that require IT involvement to restore files to disk when users need access, the StrongBox system includes the tape files as part of the same file systems as those on disk, and end users can access them directly. The files on tape just take a bit longer to access than those stored on disk.
“Tape is transformed into an online all-the-time, easily accessible file system,” says Roychowdhury. “Multiple access points can engage StrongBox simultaneously and will be presented with a unified, persistent view of the data vault. This minimizes IT dependency and allows users to experience real-time, online access.”
This system uses policies and self-monitoring to manage the storage tiering and disk caching to speed file transfer to and from tape. By utilizing a mix of tape and disk, it can achieve up to an 80 percent cost reduction over a similar capacity disk-only NAS. As storage needs grow, additional StrongBoxes can be installed, auto-discovered and added to the file system. “Data grows exponentially – budgets don’t,” says Roychowdhury. “Organizations can now invest in a cost-effective solution that seamlessly scales as their archive grows from 500,000 files to 5 billion.”
In November 2010, EMC spent $2.25 billion to acquire scale-out NAS vendor Isilon Systems of Santa Clara, CA. Isilon’s storage clusters use OneFS, a fully-symmetric file system that has no single point of failure and allows from 18 TB to more than 15 PB of data and up to a trillion files to be managed in a single namespace.
“OneFS allows a storage system to grow symmetrically or independently as more space or processing power is required—providing a grow-as-you-go approach and the ability to scale-out as your business needs dictate,” says Brian Cox, Sr. Director Product Marketing. “Nodes can be added to the file system and be ready to use in minutes—versus a traditional file system which can take hours to install, configure and provision.”
Cox says that while Big Data was initially limited to specific industries such as life sciences, media and entertainment, or Web 2.0, it is now finding broader application in traditional business computing.
“Today, the clear delineations that have existed between Big Data vertical industry requirements and enterprise IT requirements have now blurred to the point that they are no longer distinguishable,” he says. “The simple fact is that these two worlds are rapidly converging, creating a need for a fundamentally different way to meet the storage needs that enterprises will have going forward.”
But that convergence doesn’t mean that the same exact systems would be used. Cox explains that it is critical to match the storage system to the business and storage needs. For example, if an organization’s file and unstructured data needs are growing slowly and will stay under 150 TB for the foreseeable future, he recommends going with a scale-up design such as EMC’s VNX unified storage. But if the need for file and unstructured data needs are over 50 TB and growing fast, then they should choose a scale-out architecture such as EMC’s Isilon storage.
“Customers need to select the right tool for the right job and thus need to understand to profile of their workload over time,” says Cox.
HP’s IBRIX X9000 Storage product family uses a “pay-as-you-grow” modular architecture that allows customers to gradually purchase and centrally manage storage — up to 16 PB in a single namespace — as their needs grow.
“This highly scalable and economical file storage infrastructure serves as an effective archive for the HP ecosystem of Big Data solutions that includes Apache Hadoop for batch analytics on unstructured data, HP Vertica for structured real time analytics and HP Autonomy for meaning-based computing,” says Stephen Bacon, Senior Manager of NAS Product Management in Fort Collins, Colorado. “It is complemented by HP’s Information Management and Analytics consulting practice plus technology implementation and support services.”
When selecting Big Data file storage, Bacon says that HP recommends customers look at (1) the requirements of their workloads, (2) the roadmap and pace of innovation for each vendor’s offering, (3) the economics of each vendor’s offering including whether they are modular with all-inclusive features and (4) the ecosystem of solutions and services that each vendor enables.
“‘Pay-as-you-grow’ modular architecture enables customers to avoid storage over-provisioning and manage costs,” he says. “All inclusive features for data protection, data retention, and data mobility including tiering ensure no hidden expensive add-ons.”
Hitachi Data Systems
Fred Oh, Sr. Product Marketing Manager, NAS Product Line for Hitachi Data Systems (HDS), says that while the term Big Data may be new, it is an outgrowth of developments dating back a decade.
“Pharmaceutical companies, eDiscovery events, Web scale architectures and other industries/activities have long deployed HPC-like infrastructures affording users the ability to answer big questions from Big Data” says Oh. “What is different now? Simply, what is possible now comes from infrastructure convergence (e.g. petabyte-scale storage, 4+ node clustered systems, and analytics) that is more efficient and cost effective, meaning more companies can take advantage of Big Data architectures and tools.”
The Hitachi NAS Platform (HNAS) is designed to integrate seamlessly with Hitachi SAN storage through the Hitachi Command Suite for unified management. Rather than an appliance, HNAS is a hardware-accelerated NAS solution based on Field Programmable Gate Arrays (FPAs). The hardware acceleration makes possible a 1.6 Gbps throughput for sequential applications and up to 200,000 IOPS per node. The system can scale up to 16 PB of usable capacity, with a file system size of 256 TB and 16 million file system objects.
In evaluating a Big Data NAS, Oh recommends looking at certain criteria, including:
- Enterprise-class performance and scalability
- Clustering beyond 2 nodes with a Single Namespace
- Hardware acceleration for network and file sharing protocols
- Large volumes and file system support
- Dynamic provisioning (aka Thin provisioning)
- Object-based file system supporting fast metadata searches
- Policy-based intelligent file tiering and automated migration (internal and external)
- Capacity efficient snapshots and file clones
- Virtual servers to simplify transition and migration from other filers
- Content-aware and integrated with backup and active archiving solutions
- Symmetric active-active storage controllers
- Unified virtualized storage pool supporting block, file and object data types
- Page-based dynamic tiering
But getting the right hardware in place is just one small piece of achieving the promise of Big Data.
“HDS believes Big Data to not be about technologists salivating at a new gold rush,” says Oh, “but about the promise of everyday people interacting with confidence in technologies to answer questions that may require analyzing enormous quantities of data to make their work and, ultimately, society a better place.”
Panasas, Inc. of Sunnyvale, CA, makes two 4U rackmount appliances – ActiveStor 11 and ActiveStor 12, each containing 20 2TB or 3TB SATA drives. They can scale up to 6.6 PB and 4 billion objects per file system.
“Big Data requires a new approach to software and storage in order to derive value from highly unstructured data sets,” says Panasas Chief Marketing Officer, Barbara Murphy. “Buyers who consider integrated platforms with software, storage and networking provided within a single, appliance-type system will find them exponentially easier to deploy and manage than DAS or do-it-yourself clusters, with considerable benefits in terms of reliability at big-data scale.”
The ActiveStor appliances use a parallel storage architecture and the Panasas PanFS parallel file system and can achieve a 1.5 GB/s from a single ActiveStor 12, or up to 150 GBps per file system. Rather than using traditional hardware RAID controllers, PanFS performs object-level RAID on the individual files.
“Unlike other storage systems that loosely couple parallel file system software like Lustre with legacy block storage arrays, PanFS combines the functions of a parallel file system, volume manager and RAID engine into one holistic platform, capable of serving petabytes of data at incredible speeds from a single file system,” says Murphy. “This is accomplished while avoiding the management burden and reliability problems inherent in most competitive network attached storage systems, making Panasas storage ideal for private cloud computing and dedicated storage environments alike.”
The SGI NAS is a unified storage solution composed of 2U and 4U building blocks containing SSD, SAS and/or SATA drives in sizes and speeds to meet customers’ needs. It uses a 128-bit file system, so it can hold up to 2^48 entries. The file system is unlimited in terms storage capacity, but the largest single namespace file system shipped to date is 85PB.
“It can address 1.84 x 10^19 times more data than 64-bit systems such as NTFS,” says Floyd Christofferson, Director of Storage Product Marketing for SGI in Fremont, California. “The limitations of ZFS are designed to be so large that they would never be encountered.”
SGI NAS includes full VM integration, even in mixed vendor environments. With support for multiple NAS and SAN protocols, standard features include inline de-duplication and native compression, unlimited snapshots and cloning, unlimited file size, and high-availability support. The NAS can be administered through a browser-based GUI from any desktop or tablet.
“You want to select a NAS solution that enables straight-forward integration into legacy storage environments and ensures that data is not trapped within expensive siloed arrays,” says Christofferson. “The solution must be flexible enough to not only manage continued data expansion but also not lock their organization into one type of storage architecture. Data will only continue to grow, so it’s important to select products that have a strong potential for boosting ROI.”