Resource Management and Control: A Proposal for Linux - Page 2
So what are the resources that need to be managed and why?
Volatile page memory. Management of this type of memory should include the management of various page sizes pools. Newer versions of the Linux kernel support various page pools pages of different sizes available for allocation. Just being the first in line should not be mean that an application should be able to access and use these pools, or you might want to limit how much they can use. Management of page pools should be part of any resource control management framework.
One other thing that should be considered is changing allocations for some high-priority action during a long-running job. So there must be a way to de-allocate the space that is being used by a thread/process that is running and using a resource.
Volatile shared memory. This type of memory should be managed similarly to large pages. There must be management of allocations and a method to de-allocate shared memory.
Non-volatile memory.In my opinion, this is one of the major reasons this new framework needs to be researched and implemented. It is a very high-cost resource. Let’s say 6 TB of disk is about $900 with a controller and RAID storage or $150 per TB, and let’s say NAND SSD storage is $1200 per TB, which is the going rate today for enterprise SSDs. The cost per GB of non-volatile memory is going to be another 6x to 10x (TBD) according to everything I have seen written, but this is still a far lower cost than DDR memory. Therefore this is a very expensive resource, and expensive resources need to be managed.
There was a great presentation at the IEEE Mass Storage Conference by Intel’s Andy Rudoff, who discussed what Intel is doing for management of this type of emerging memory. I think what Andy presented is a reasonable first step, but far more needs to be done to really move forward to a management framework.
Storage. We all know about storage quotas, but these need to be extended to various tiers of storage. I know that some vendors have done this, but there needs to be a common framework. Quotas as we know them today likely cover the POSIX arena but what about object storage? There need to be some research and some work to come up with a consistent management framework that all can agree upon (AWS, OpenStack, Google etc) for files counts, tiers and data space.
Storage bandwidth. This is the holy grail of system management. People have tried to address this for decades using everything in the stack. I have seen OS implementations, file system implementations, storage network, and even storage controllers. The mainframe solved this problem a long time ago, and to a great degree, parallel file systems such as GPFS and Lustre, which support many hundreds and even thousands of clients writing to the same file systems, have done the same. Coordination and management of bandwidth is very complex and will be very difficult while still maintaining file system performance.
How Can We Do All This?
Almost three decades ago, the developers of the Cray UNIX operating system created something called a user database (UDB), which listed the resources that could be used and managed. For example, one of the resources that could be managed was the CPU scheduler, and it controlled which users could use the real-time scheduler. This was cool stuff for UNIX in 1988, but was of course something that IBM's MVS had solved long before, as had a few other operating systems.
I think there are going to have to be levels of management and control. Fine-grained management is going to have to be well defined such that scheduling might have to kick in slowly when you are talking about managing workflow across thousands or even tens of thousands of nodes. I think administrators are going to have to define the levels of granularity they are going to need to schedule against. For example, you might start at ten minute intervals and then after five minutes check the measurement and then drop it to two minutes. This, of course, will not work for short-running processes that need lots of different resources but are very high priority.
There is clearly a need for some basic research in this area to address scheduling and management requirements. Algorithms for scheduling, collections methods, distribution of scheduling and policy frameworks are just a few of the areas that need to be researched.
We can learn some lessons from the past, from mainframe operating systems and maybe from what Cray did in UNICOS back in the 1980s. But with 100,000+ core systems here today and 1,000,000+ core systems on the short term horizon there needs to be a new paradigm and a better understanding of what can and cannot be done. New technologies like non-volatile memory and PCIe SSDs that are expensive but can really speed up certain workloads are likely going to drive these requirements.
The real question is can the community get their act together and do something? Or are we going to continue with the status quo and just continue to throw money at the problem by purchasing more hardware to run high-priority applications on dedicated resources?
Photo courtesy of Shutterstock.