Building a Storage Architecture for E-Discovery
In the last couple of years, the whole concept of e-discovery has changed the way organizations view information and given rise to a myriad of solutions aimed at meeting e-discovery mandates. Since this column is about storage, we'll focus on the storage implications of those e-discovery solutions, which pose problems that storage architects need to consider if they want to keep up with data growth, performance and backup and recovery demands.
Today's e-discovery systems essentially have two ways of organizing data, and most systems are blade-oriented, so adding blades increases system performance. Blade-based e-discovery is the most common way of doing e-discovery except for some specialized systems that have a huge amount of memory and are often used in the mainframe world.
The two main ways these systems organize data are:
- Each of the blades has local storage associated with it, and the local storage is used for the data that's being analyzed, or
- Each of the blades is connected to a network and the storage is shared across the e-discovery system.
The basic difference between these two methods is that in the first case, the application links the various nodes and spreads the data across the nodes. The application communicates to the various nodes and understands which nodes have which information. In the second case, requests are made to the nodes, and the nodes make I/O requests to the shared file system.
Much of determining which method works bests depends on the data itself, the types of search, the data layout (file system and storage connectivity) and the application. Each of these methods has advantages and disadvantages from a storage perspective and that is what we need to explore. Of course, the e-discovery application has a great deal to do with the efficient use of storage, but understanding the underlying requirements of the e-discovery application will likely give you a reasonable understanding of how well the storage scales and therefore how well the e-discovery application will likely scale.
Method 1: Divide Storage and Conquer
In this method, each of the blades has its own local storage and file system. Since the local file system is relatively small, file system performance for data allocation and data fragmentation is not likely to be a concern. Also, for most e-discovery systems, data is not deleted but just added, so in the divide and conquer method, additional nodes and additional file systems are added in a consistent ratio as data is increased. Sometimes there is a head node or nodes containing index data and information about what is stored on each node. Scaling for this type of system is accomplished by adding equal parts blades and storage. The application might or might not benefit from adding CPUs with storage, but that is how scaling is accomplished.
Some of the storage challenges of this type of system include:
- Backup and restore
- Blade updates to software and hardware
- Disk failure
Backup and restore
If you have lots of nodes, managing backup and restoration often becomes a complex task. Often you cannot back up directly to tape, given the performance of D2T backup, so you end up doing D2D or D2VirtualT. All of this is compounded by the time it will take to restore the data on the local disk, so many applications mirror the drives and sometimes mirror the whole node. All of this costs money for hardware, software, power and cooling.
Blade updates to software and hardware
The question I always ask the software vendors is how do I upgrade and update the nodes? If the answer is through downtime, you need to ask yourself if you can afford the downtime. Some systems cannot afford significant downtime for upgrades to hardware or software. If you are using software that divides the problem across many nodes, updates must be able to be rolled into the system without significant downtime, just as hardware upgrades must be able to mirror nodes and then bring them into the system. You need to plan for this as part of the initial installation, not three years into the project.
Even though disk drives have gotten more reliable over the last few years, they still fail. You need to find out what happens if a disk drive fails within the system. Does the node just get replaced, is the node mirrored, or what happens?
Method 2: Global Storage Pool
If you are using a global storage pool, all of the processing nodes are connected through a shared file system or shared mount point via NAS. Scaling is limited by the file system, as each node sees all of the data. Parallel searches require that many I/O requests be sent to the storage pool, as the latency to access storage is higher with this method given the large storage network, but the latency might be offset by the latency between interconnects for the first method.
Some of the storage challenges are:
- Backup and restore
- Fragmentation and file system scaling
- Storage configuration complexity
Backup and restore
In the first method, you had many nodes to manage for backup and restoration, but with this method you have a huge file system that must be backed up and restored. This becomes a big problem in the event of a failure or data corruption. Backing up many nodes is a complex proposition, but restoring a large file system will require significant downtime for restoration.
Fragmentation and file system scaling
With a large file system ingesting lots of data, the chances of the file system become fragmented are higher. Some will say that fragmentation is not an issue because all of the I/O requests will be random, but from my experience, many e-discovery applications have some sequential I/O because that is how the data is written to the file system. What I have seen from traces of searches is that sometimes the requests are sequential, then have a skip increment and then are sequential again. Today standard RAID systems cannot see this sequentially, especially if the file system does not write the data in order because of fragmentation. In the future this problem might be alleviated with object storage, as the access pattern might be recognized by the OSD target (see Let's Bid Adieu to Block Devices and SCSI).
Storage configuration complexity
Like it or not, large shared file systems and even large NAS systems are complex to manage and maintain. This is not to say that a large cluster of nodes isn't difficult to manage, but you are moving the management problem from the clusters to the storage. Storage seems to me to be harder to manage than clusters, at least with the tools that are available today.
No Right Answer
From my perspective, adding equal amounts of storage and computation seems to make the assumption that all things are equal, that for each number of bytes of storage you need X amount of computation. Since storage densities are not increasing at the same rate as CPU performance and storage performance increases and seek and latency time decreases are abysmal compared to CPU performance increases, this model seems broken to me. If you add things in equal amounts with blades and ratios are not balanced, you need to add more than you often think you need to achieve the balance, with increased costs for power and cooling.
On the flip side, shared file systems often have scaling problems and require a significant amount of attentiveness to architectural planning.
There is no right answer as to which approach is best, as what is really required to get the best result is to understand the types of information you will be looking for and to get a clear understanding of how the e-discovery applications work. If it was simple, everyone would be doing it.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.