Resource management and controlling the allocation of resources for complex workloads has always been a topic for discussion in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe. On IBM’s MVS and later OSes, resources can be allocated and managed in such a way as to execute policy, whether that policy be to prioritize credit card approval codes at Christmas time or to prioritize stock purchases from a specific broker.
The commonly used approach today is to buy more hardware so that workloads are not prioritized but everyone is served equally. This works, of course, but distributing the workload evenly is costly and complex. With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow.
I think the time has come for Linux—and likely other operating systems—to develop a more robust framework for resource management that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries.
Where to Begin?
You cannot schedule resources if you are not monitoring resources. Lots of things are going to need to be monitored so they can be managed. Additionally, the user context must be passed along so that if a user makes an SQL query or request to a daemon, the information about the user that made the request is passed along in the thread.
The following resources need to be monitored so they can be managed in the future:
- Storage queues from the operating system and all the way down to, but not including, the disk drive
- Memory allocations for all types of memory and page sizes
- Network communication queues and resources
Monitoring the queued I/O requests to a file system on a single server might not be that hard, but monitoring the queues to a NAS device or parallel file system is a whole lot more difficult. However, if you do not monitor the file system, you will never be able to control it, and a single user or process or even thread could use a significant portion of the bandwidth, starving higher-priority work.
I am not suggesting the disk queues as this would be very difficult with modern disk drives for two reasons. First, the concept of a user or process at the disk drive or storage device level is not known and moving that framework into the SAS standard would be next to impossible. Second, if the requests sent to the disk drive are the higher-priority requests, then the file system will be controlling the disk drive, so monitoring and therefore management should be done at a higher level.
The need to manage memory is a big concern. We have now have various page pools for various sizes of large pages. We also have shared memory and soon we will have non-volatile memories to deal with. Who should be able to allocate from a page pool and who cannot? What happens if there is not enough of a memory resource?
Being able to monitor the various memory resources is going to be important, given that this is likely the critical resource for many environments.
Today the CPUs are pretty well monitored and scheduled as a resource. Though, as you will see, I would change the scheduling algorithm a bit for some workloads.
I believe that this is similar to storage and that the user context needs to be moved down so that users get the allocations of bandwidth that they are allowed rather than a free for all. A good example would be an eight-core CPU utilized by a high-priority user and a low-priority user. If the high-priority user is utilizing four cores and four cores are free, the low-priority user could be making requests to use the network, and since the cores are free that could impact the high-priority user’s network usage.
My Straw Proposal
I will make the assumption that everything at every level is monitored and tracked by user and, likely, by group and for SELinux potentially by security level and compartment. In addition, I propose adding something like a project identifier. Users and groups are used for permissions, while projects are often used for accounting in some operating systems. For example you might be in the HR group, but for accounting purposes you might want to be able to charge your time to the Seattle HR group or maybe even to the development group if you are doing recruiting for them. So let’s assume that you have resource management by:
- Security level and compartment
So what are the resources that need to be managed and why?
Volatile page memory. Management of this type of memory should include the management of various page sizes pools. Newer versions of the Linux kernel support various page pools pages of different sizes available for allocation. Just being the first in line should not be mean that an application should be able to access and use these pools, or you might want to limit how much they can use. Management of page pools should be part of any resource control management framework.
One other thing that should be considered is changing allocations for some high-priority action during a long-running job. So there must be a way to de-allocate the space that is being used by a thread/process that is running and using a resource.
Volatile shared memory. This type of memory should be managed similarly to large pages. There must be management of allocations and a method to de-allocate shared memory.
Non-volatile memory.In my opinion, this is one of the major reasons this new framework needs to be researched and implemented. It is a very high-cost resource. Let’s say 6 TB of disk is about $900 with a controller and RAID storage or $150 per TB, and let’s say NAND SSD storage is $1200 per TB, which is the going rate today for enterprise SSDs. The cost per GB of non-volatile memory is going to be another 6x to 10x (TBD) according to everything I have seen written, but this is still a far lower cost than DDR memory. Therefore this is a very expensive resource, and expensive resources need to be managed.
There was a great presentation at the IEEE Mass Storage Conference by Intel’s Andy Rudoff, who discussed what Intel is doing for management of this type of emerging memory. I think what Andy presented is a reasonable first step, but far more needs to be done to really move forward to a management framework.
Storage. We all know about storage quotas, but these need to be extended to various tiers of storage. I know that some vendors have done this, but there needs to be a common framework. Quotas as we know them today likely cover the POSIX arena but what about object storage? There need to be some research and some work to come up with a consistent management framework that all can agree upon (AWS, OpenStack, Google etc) for files counts, tiers and data space.
Storage bandwidth. This is the holy grail of system management. People have tried to address this for decades using everything in the stack. I have seen OS implementations, file system implementations, storage network, and even storage controllers. The mainframe solved this problem a long time ago, and to a great degree, parallel file systems such as GPFS and Lustre, which support many hundreds and even thousands of clients writing to the same file systems, have done the same. Coordination and management of bandwidth is very complex and will be very difficult while still maintaining file system performance.
How Can We Do All This?
Almost three decades ago, the developers of the Cray UNIX operating system created something called a user database (UDB), which listed the resources that could be used and managed. For example, one of the resources that could be managed was the CPU scheduler, and it controlled which users could use the real-time scheduler. This was cool stuff for UNIX in 1988, but was of course something that IBM’s MVS had solved long before, as had a few other operating systems.
I think there are going to have to be levels of management and control. Fine-grained management is going to have to be well defined such that scheduling might have to kick in slowly when you are talking about managing workflow across thousands or even tens of thousands of nodes. I think administrators are going to have to define the levels of granularity they are going to need to schedule against. For example, you might start at ten minute intervals and then after five minutes check the measurement and then drop it to two minutes. This, of course, will not work for short-running processes that need lots of different resources but are very high priority.
There is clearly a need for some basic research in this area to address scheduling and management requirements. Algorithms for scheduling, collections methods, distribution of scheduling and policy frameworks are just a few of the areas that need to be researched.
We can learn some lessons from the past, from mainframe operating systems and maybe from what Cray did in UNICOS back in the 1980s. But with 100,000+ core systems here today and 1,000,000+ core systems on the short term horizon there needs to be a new paradigm and a better understanding of what can and cannot be done. New technologies like non-volatile memory and PCIe SSDs that are expensive but can really speed up certain workloads are likely going to drive these requirements.
The real question is can the community get their act together and do something? Or are we going to continue with the status quo and just continue to throw money at the problem by purchasing more hardware to run high-priority applications on dedicated resources?
Photo courtesy of Shutterstock.