Big data definitions shift over time. When it first came into vogue, one of the definitions was simply a lot of data residing on primary systems. An updated definition is large data stores and/or datasets that are subject to active business processes like analytics.
Now a growing subclass of big data is shifting the definition again: machine data. Yale Professor Daniel Abadi defines machine data as “data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.” In other words, humans invent the machines (this is not Skynet) but machines generate data without human input.
Machine data is massive and never-ending, generating data from 1 trillion embedded sensors throughout the world as well as utilities exploration, connected cars and houses, microscopy imaging, weather tracking, and much more.
The first and most obvious challenge is that there is a lot of machine data, and the amount is growing. Massively scalable storage is a base requirement for storing unstructured machine data. On top of that, machine data and ranges in types and sizes from tiny log files to GB-sized or larger image files. This incredible scale requires a specialized storage system with high performance, massive scalability, and the ability to efficiently manage vastly different file sizes and access patterns.
Beyond Analytics
The widest application for big data in general is analytics. It is not possible for humans to immediately grasp the meaning of raw unstructured machine data; we need software to assign structure from the raw data points and to analyze and present results in an understandable format. Analytics programs are necessary to identify patterns and unusual events, assign structure to unstructured data, and self-learn from incoming data.
A few examples include analyzing traffic or user patterns in Web server and network event logs, trending in automated financial trades, or categorizing customer phone calls from a busy call center.
Distributed cloud-based architectures are especially popular given the cloud’s massive scalability. However, this architecture runs into a problem when companies 1) require fast processing on massive machine data and 2) need to manipulate that data beyond analytics. Cloud latency becomes a serious issue in these cases.
Enter Massively Scalable NAS
This is where on-premise specialized NAS storage comes into play: massively scalable, high performance file storage systems that support intensive processing from a variety of applications. Data may replicate or archive to the cloud but the cloud is not the primary processing environment.
For example, one automotive services provider ingests several millions of image files every day. Streaming this amount of data to the cloud is a non-starter because of cloud latency. The provider uses massively scaled NAS to store the images and uses cloud caching to maintain pointers to the on-premise files.
In another example, a university digital library ingests huge volumes of raw data from oceanic, climate, and genomic studies all over the world. A high performance NAS system stores the data into a single namespace for analytics and scientific processing.
These specialized single namespace NAS systems are custom-built using commodity hardware to store and process huge quantities of raw machine data. They are capable of storing billions to trillions of vastly differently sized files and linearly scale capacity and performance. They easily support analytics applications and also high-end applications such as rendering, microscopy, image processing, and 3-D animation.
There are very few vendors in the world today who offer this specialized solution. Isilon was the first offering out the gate with the capability of storing and processing millions of files. Isilon’s original team developed the OneFS file system from FreeBSD.EMC acquired Isilon and continues to develop it. Its high performing S-Series and X-series offer fast performance and massive scalability for intensive computing applications.
Following the EMC acquisition, Isilon’s original team members founded Qumulo to provide high volume storage and processing for intensive machine data computing. The flash-based clustered system comes with native data-aware analytics for real-time visibility into massively scaled stores; the capability is built into Qumulo’s Scalable File System (QSFS). Qumulo Core software is built on Linux and can run on commodity hardware, VMs, or dedicated appliances.
What to Look For
When researching massively scalable storage systems for machine data, look for the following characteristics:
Massive scalability. Scaling to billions of files is a given. Look for linear performance and capacity scaling. Flexible node and cluster node architecture plus compact rebuild times should allow you to painlessly add or subtract nodes. Look for systems with a combination of flash drives for high performance and dense multi-TB drives for storing massive volumes of machine data and large active worksets.
High performance architecture. Storage systems need intensive performance to ingest and process huge amounts of machine data. Don’t only look for flash capabilities: also ask the vendor how they handle system overhead. For example, Qumulo dispenses with the overhead of hardware RAID and uses efficient erasure coding in place of mirroring.
Efficiently stores different file access types and sizes. Machine data is a wild world of transactional and sequential access patterns and wildly different file sizes. Machine data storage must be able to effectively handle all of it – large and small files; sequential and transactional access patterns.
Data analytics. A massively scalable, high performance storage system will be a bear to manage without real-time analytics. Native data-aware functions should offer real-time visibility into capacity trending, current activity and hot files, workload performance, and cluster configurations. If the system can feed results into other management systems, that is a big plus for storage management.
Vendor support. Some high performance storage systems have solid features but score low in system support. When you have a legacy general purpose server with a problem, maybe you can wait on support. When you have a mission-critical, massively scalable primary NAS, you can’t. Look for support that is fast, expert, and multi-channel – and that won’t bust the budget on second-year maintenance fees.