I have been saying this for a long time, and the trend is clear: whatever your application, be it cloud, Hadoop or file system, appliances are in your future. If you have a storage problem, at least one vendor has a solution to your problem that plugs in and works.
Data center consolidation, either within a corporation or to a public cloud, is very much a part of today's IT landscape. So what should you be doing to ensure that you have a job in the future, with your current employer or a future one?
My advice: Get on the appliance bandwagon and get ahead of the curve.
When companies outsource all or part of their IT infrastructure, it is because someone else can make a profit doing it. The margins I have seen and heard on IT outsourcing are up to 25 percent. Ask yourself why some other company or some cloud provider can buy all the hardware and software needed and still make a profit over the company's internal IT department. From what I see, part of that is due to internal politics that often prevent efficiencies in the data center. Each department wants to have it the way they want to have it.
But the appliance model is going to change the way that people think about IT, and it will change how organizations are structured.
Since this is a self-help article, I want to cover a number of difference appliances that you should be studying up on so that you are ready for the future. If your IT infrastructure is stovepiped, without integrated divisions for storage, virtualization and computation, the environment is going to need to change quickly over the next few years. Otherwise, you might be looking for a new job, as some vendor is going to come in and modernize your environment either by outsourcing to a vendor or IT contractor or by moving it to a cloud provider.
My view is that you need to get with the plan and you need to prepare, as the light at the end of the tunnel is a train coming at you. Let's talk about some of the various appliances that you will need to become familiar with.
These types of appliances are divided into three camps today.
- Standard Hadoop
- Shared file system Hadoop
- Fast storage appliance Hadoop
With a standard appliance, you buy nodes that are preloaded and configured and hardware optimized for Hadoop.
You can buy this type of hardware and software from many vendors. In some cases you are just buying the software for your own cluster, and in other cases you are buying the hardware and software from a single integrator that has optimized both. Either way this is standard Hadoop with three-way replication and hardware and software configured to run Hadoop—and not much else.
Shared file system Hadoop
A shared file system appliance generally has either the Lustre or GPFS file system that optimizes the shuffle phase in Hadoop. This works because the data can be globally read from the nodes and does not have be read and distributed across the network. All of the nodes are attached to the shared file system and can read the data directly from the storage without having to go from server to network to server to storage.
This has shown to be significantly faster for some problems than the standard configuration method for Hadoop. In addition, you have the reliability of RAID and failover (if designed in the architecture). Vendors have reliability studies showing that triple replication is not needed with storage if it is RAIDed.
Fast storage appliances
A number of vendors have or are developing SSD appliances for Hadoop. There are lots of them and more on the way. These are optimized for Hadoop and are easy to manage.
Which is the best?
Of course, the answer depends on the amount and type of data, how much is coming in and how many queries are going on. This is an area where you can help yourself by understanding the issues and asking the right questions.
Big File System Appliances
For now there are two different large shared file systems used for large storage appliances—GPFS and Lustre. Multiple vendors make these appliances. While Lustre is an open source project, GPFS is a product from IBM.
These file systems scale far beyond what any current NAS vendor offering has. Both of these file systems scale to thousands of clients and offer performance to many hundreds of GiB/sec. What NAS vendor has 30+ PB in a single namespace that has scalable performance?
The problem is that, for the most part, both file systems have been designed around the requirement for large block, sequential I/O for user applications. This is not to say that that the hardware and software might not be configured to support smaller block sizes. I am not saying that small block performance will be better with a NAS box, but here are some questions you might want to ask to show your shared file system prowess to your management.
- Understand your workload in terms of:
- How many I/O requests are being done at the same time?
- What are the read/write ratio and read and write request sizes?
- How many open/creates being are being done at one time?
- How much storage is needed?
- Is ANSI T10 DIF/PI used?
- Is some other method used?
- Does it use checksums or error correction code?
- How does the vendor tell you which disk drive is causing the errors?
- Is the RAID declustered?
- How long does a rebuild take?
- What is the performance hit during rebuild?
These are just some of the important questions that need to be asked for these types of appliances. As you scale up, you need to ensure that nothing else breaks and that you can meet your mission.
There are lots of other data analysis products and database products that are here today and even more coming down the path that might meet your organization's requirements. These new appliances might correlate information in some method or use graph analysis to look for relationships and/or some other method.
The issues are going to be the same: if you can't do it faster and cheaper locally with the technology you have, then your work might be outsourced to the cloud.
Our Jobs Are at Stake!
If CIOs and the staffs at many companies do not get with the plan, someone else will. And then someone else will do your job for you.
My best friend once said to me when we were in Japan testing a new file system we had designed, "We need to eat sushi or we will be sushi." The point was you either go with the flow of what is happening or you become a statistic.
I think the way things are going in the market, we are all going to have to learn some new skills. Management has to be included in this process, as things are going to have to be organized to be efficient.
If not, be prepared to have your work outsourced either to the cloud or to another organization.
There are a bunch of new technologies that are going to solve new problems and old problems. We all need to become familiar with these technologies to survive.
I think that much of what we see today for storage will become specialized appliances. While 90 percent of the data accessed today is likely accessed via a POSIX file system interface and 10 percent from an object interface, that is going to change over the rest of the decade.
Get ready—this will be similar to the micro-processor ride in the 1990s when we changed from proprietary large processors to micro-processors from DEC, MIPS, Intel and others.