Top of the Big Data Stack: The Importance of the Software Stack - Page 3
Big Data OS
Underlying the application layer, the application languages, and parallel data access, lies something equally important, the OS (Operating System). When a data request hits the OS there are a number of things that happen to it.
OS's are all different but in general, there is an IO scheduler that schedules and retires IO requests based on some algorithm. This scheduler will also try to do things such as combine neighboring data requests to make the data requests so a single read or write can satisfy all of the combined requests. This can improve throughput but it also increases latency, reducing the apparent IOPS (Input Output Operations Per Second) performance.
However, IOPS can be very important to Big Data applications. Remember that data is stored in some fashion such as key-value pairs, so that if you need a certain piece of information, the data access is likely to force a seek to somewhere else on the storage media. This increases latency and the total time to access a very small bit of data. This data access pattern can happen quite often in Big Data, to the point where random IOPS performance of the underlying storage is the bottleneck for application performance.
There are some approaches you can use to help data access patterns such as wide column store databases, but as you scale the data in the database either in size (total capacity) or amount of data (number of records), IOPS will become an increasingly important aspect of performance.
As mentioned earlier this is driven by various pieces in the OS including context switching, which Henry Newman has written about. The need for more IOPS puts tremendous pressure on the OS. Coupled with this are SSD (Solid State Drive) manufacturers claiming close to one million IOPS from a device.
How does an OS handle this many IO requests? This is going to force a reexamination of how an OS handles very large IOPS. This means that we as a community will need to push the OS writers to rethink or adapt to the needs of Big Data.
Device drivers are one of those wondrous things that just seem to happen. New hardware comes out and – bingo! – there are device drivers that allow the OS to communicate with the new hardware. But it's really a "duck" situation. That is, on the surface, the duck (device drivers) looks calm and serene and underneath the water, their legs are paddling furiously (the device driver writers).
There is new hardware coming out all the time that requires people to write device drivers or otherwise the hardware is worthless. But perhaps worse, hardware vendors love to tweak their hardware ever so slightly, which may or may not break existing device drivers, causing the driver to be modified or even re-written.
In my opinion the unsung heroes of an OS are the device driver authors. It is definitely not an easy task and requires great amount of coordination and testing.
How do device drivers impact Big Data? Besides the obvious issue of making sure you have the correct and up to date device drivers in place, in my opinion there is one key problem the Big Data community has not come to grip with - and that is the operation and administration of large distributed systems.