Big data tools collect, store, organize, and analyze large amounts of data for information. The sheer volume of data stored by enterprises has mushroomed since unstructured data began to be valued in the enterprise.
Previously, key organizational data was gathered within highly structured databases.
But the rise of virtualization, social media, streaming data, object storage, and other innovations resulted in far more data being available in unstructured repositories than previously existed within relational databases.
Platforms such as open source Hadoop burst onto the scene as a way to capture and organize all this data. This enabled enterprises to mine this data for insight, and subject it to types of analysis that had never before been possible.
However, Hadoop is far from the only platform used for large quantities of unstructured data. The many storage vendors within the storage ecosystem evolved ways for their existing systems to accommodate so much capacity.
As the extent of unstructured data grew, the term “big data” was coined to differentiate it from earlier storage concepts. Initially, startups ruled the unstructured space. But acquisitions, and in-house development have led to a few providers dominating the big data arena.
Also read: Top Data Management Platforms & Systems 2021
Table of Contents
Key Features of Big Data Tools
What are the minimum features for a big data storage platform?
- Performance in terms of fast application response time, reduced run time.
- Availability and superior application resiliency.
- Simplicity of deployment and operation.
- Affordability and flexibility to grow or contract with operational needs.
- Built-in analytics or the ability to easily feed data sets to analytics engines.
Top Big Data Tool Vendors
Enterprise Storage Forum evaluated various vendors in the big data tools and software space. Here are our top picks, in no particular order:
StorCentric Violin QV-Series
StorCentric Violin QV-Series is a simple, fast, affordable high-performance NVMe storage platform. It offers big data storage and analytics. It can work with hundreds of terabytes, or even petabytes. For those using Hadoop in a batch process to create reports, the fact that this process is iterative means faster I/O can allow you to perform more iterations in a day and arrive at useful business information faster.
- Accelerated log file analysis for faster analytics in Splunk.
- Real-time I/O for NoSQL databases such as Mongo or Cassandra.
- I/O performance for multiple Hadoop iterations.
- Data Reduction by Volume.
- Synchronous Replication.
- vCenter Plug In.
- Active/Active Controllers.
- Matrixed RAID data allocation with up to 24 hot-swappable NVMe SSDs.
Oracle Big Data Service
Oracle Big Data Service is a Hadoop-based data lake to store and analyze large amounts of raw customer data. A managed service, Oracle Big Data Service comes with a fully integrated stack that includes both open source and Oracle tools that simplify IT operations. It makes it easier for enterprises to manage, structure, and extract value from organization-wide data.
- Oracle offers big data services as part of Lake House to help data professionals manage, catalog, and process raw data
- Object storage and Hadoop-based data lakes for persistence, Spark for processing, and analysis through Oracle Cloud SQL or the analytical tool of choice.
- Oracle Cloud Infrastructure Data Flow is a managed Apache Spark service with no infrastructure for IT teams to deploy or manage. It lets developers deliver applications faster.
- Oracle Autonomous Data Warehouse is a cloud data warehouse service that eliminates the complexities of operating a data warehouse, securing data, and developing data-driven applications.
- Oracle Cloud Infrastructure Object Storage enables storage of any type of data in its native format.
- Oracle Cloud Infrastructure Data Catalog helps data professionals search, explore, and govern data using an inventory of enterprise-wide data assets.
- Oracle Cloud Infrastructure Data Integration extracts, transforms, and loads (ETL) data for data science and analytics.
Pure Storage FlashArray
Pure Storage offers two arrays suitable for big data use cases. The FlashArray//X is aimed at high performance while the FlashArray//C is the high capacity version. It’s a case of which attribute is favored in the enterprise, or required more by the application and environment. These arrays serve needs ranging from departmental to large-scale enterprise deployments. They provide performance, reliability, and availability for both block and file.
- Evergreen Storage eliminates upgrade cycles, downtime and rebuys of TBs already owned. The Evergreen Storage subscription model offers rapid upgrades and expansion without disruption.
- FlashArray//X ai all-flash end-to-end NVMe and NVMe-oF array.
- FlashArray//C is more about consolidating workloads with consistent all-flash NVMe performance and data protection.
- Always on deduplication, compression, and thin provisioning.
- Low-latency performance for all workloads, including: sub-1 ms latency performance for mission critical workloads, up to 150 ms latency for extreme database workloads with storage class memory, and consistent 2-4 ms latency for capacity optimized workloads.
- 99.9999% availability as well as business continuity and global disaster recovery.
- Consistent data portability between on-premises and public cloud storage and applications to minimize complexity and simplify interoperability.
Dell EMC PowerMax
Dell EMC PowerMax is the high-end storage offering from the massive Dell storage portfolio. The Dell EMC PowerMax family offers high levels of performance and scale using next-generation Storage Class Memory (SCM) and high-speed SAN infrastructure. It offers the feature set required for demanding big data applications.
- Consolidate block, file, and mainframe workloads on one array.
- Inline deduplication and compression for guaranteed data reduction of almost four to one.
- End-to-end NVMe, real-time machine learning, and a wealth of data services.
- SAN provisioning in less than 30 seconds.
- 6 x 9s availability and replication for business continuance and disaster recovery.
- Automates data placement for optimal performance with no overhead.
- Secure, end-to-end encryption.
- Cloud mobility moves data from PowerMax to AWS, Azure, and Dell EMC ECS for long-term retention on lower-cost object storage.
- Multi-controller scale-up, scale-out architecture.
- Performance optimized – up to 15M IOPS, 350GB/s sustained bandwidth, under 100µs read latency.
Amazon io2 Block Express
Amazon io2 Block Express is a SAN built for the cloud. It offers customers high-performance block storage. Amazon promotes it as being available for as little as half the cost of a typical on-premises SAN. io2 Block Express volumes are aimed at the largest, most I/O-intensive, mission-critical deployments of Oracle databases, SAP HANA, Microsoft SQL Server, InterSystems database, and SAS Analytics.
- Sub-millisecond latency.
- Pay-as-you-go pricing.
- Scale capacity by petabytes in minutes.
- Using Amazon Elastic Compute Cloud (Amazon EC2) R5b instances and io2 Block Express, SQL Server runs up to 3x faster on AWS than the next-fastest cloud provider.
- Up to 256,000 IOPS, 4,000 MB/second throughput, and 64 TB of capacity.
- Ability to stripe multiple io2 volumes together.
- By decoupling the compute from the storage at the hardware layer and rewriting the software to take advantage of this decoupling, Block Express enables high performance.
FalconStor made a name for itself in data protection. It provides the breadth of storage and data protection services that big data applications require. The company takes advantage of object storage in its on-premise and cloud archival offerings, with secure data containers that can take advantage of the various capabilities offered by the major object storage offerings, both on-premise, and in the cloud.
- StorSafe seamlessly adds object storage to some of its data protection products and solutions.
- It uses the metadata management capabilities of object storage to access the most applicable data.
- It harnesses the immutable storage of WORM-compliant offerings to provide a perpetual, always available archive.
- By breaking data into fragments and dispersing them throughout the cluster, availability goes up, while a data center breach resulting in a stolen machine yields no data loss – no complete dataset can be mounted.
- FalconStor has more than an exabyte of data under management for long-term archives.
Cloudian Hyperstore offers limitless, non-disruptive scalability, mixed-configuration flexibility, consolidated file and object data, geo-distribution, and is hybrid cloud and multi-cloud ready. It integrates with all major public cloud providers. It provides a software-defined storage platform for big data applications scaling as needed to support more workloads, more users, and more data across all locations.
- Ransomware protection and data security, including on-prem S3 Object LockIn addition to providing secure shell, integrated firewall, RBAC/IAM access controls, AES-256 server-side encryption for data at rest and SSL for data in transit.
- Cloudian HyperStore integrates with VMware’s vSAN Data Persistence platform. This provides a single shared storage environment for both cloud-native and traditional applications, all managed in VMware Cloud Foundation with VMware Tanzu.
- For non-VMware environments, Cloudian offers Kubernetes S3 Operator, a plug-in that enables developers to provision and manage HyperStore object storage from within their container-based applications and with no gateways or translation layers.
- Cloudian’s HyperIQ is a monitoring, observability and analytics solution for managing storage and related infrastructure across on-premise and hybrid cloud environments.
- Management features include bucket-level policy management, eliminating cluster-wide storage policy lock-in that limits flexibility.
- Multi-tenancy, QoS, and billing for shared storage or service provider deployments.
- Scales to 100s of petabytes (PBs) via the S3 Restful API, with data integrity and protection.
Hitachi Vantara Content Platform
The Hitachi Content Platform (HCP) provides secure software-defined object storage at exabyte scale that optimizes big data platforms, like Hadoop. It harnesses various standard APIs to offer multi-cloud support, policy-based governance, and compliance and metadata management. Users can take advantage of a large partner ecosystem. Content intelligence features provide discovery and fast exploration of business data and storage operations whether on premises, off premises, in the cloud, structured or unstructured.
- Can scale from 4 nodes to 80 nodes.
- Supports S2, NFS, CIFS, REST, HTTP, HTTPS, WebDAV, SMTP and NDMP.
- Provides storage for DAS, SAN, and object.
- A cost-optimized option is available for deep data storage at massive scale.
- Gateway tools extend file services to the cloud.
- HCP Anywhere offers file sync and share, remote file services and data protection for a secure workplace.
- Partnership with Alluxio, a virtual data layer that lies between compute and storage resources, which unifies data access at memory speed and bridges big data frameworks with multiple storage platforms.
- Applications connect with Alluxio to access data stored in any underlying storage system.
- Data analytics applications such as Apache Hadoop MapReduce, Apache Spark and Apache Presto can continue running on Alluxio with standard interfaces and a global namespace
Scality Ring provides a scalable, high-performance, online data lake for big data that can be accessed by these applications over S3A such as Hadoop and Spark. Scality Ring storage runs on-premises and extends into the public cloud. It integrates file and object storage for workloads focused on high-capacity unstructured data. It encompasses multi-cloud namespaces, a native Azure object storage support, and also provides bidirectional compatibility with S3.
- Runs on commodity hardware.
- Provides integrated file and object storage in one solution rather than via a gateway.
- Scale-out, peer-to-peer architecture.
- Geo-replication facilitates high availability in disaster recovery.
- Amazon S3 and IAM APIs for object storage access.
- POSIX compatible file system, with standard NFS v4/v3 and SMB 3.0 file interfaces.
- Policy-based data replication and erasure-coding for up to eleven 9s data durability.
- Integrated hybrid-cloud data management: smart lifecycle, replication, global metadata search across Ring and AWS, Azure and Google cloud storage.
Read next: 5 Storage Needs of Modern Data Centers