Resource Management and Control: A Straw Proposal for Linux
Resource management and controlling the allocation of resources for complex workloads has always been a topic for discussion in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe. On IBM's MVS and later OSes, resources can be allocated and managed in such a way as to execute policy, whether that policy be to prioritize credit card approval codes at Christmas time or to prioritize stock purchases from a specific broker.
The commonly used approach today is to buy more hardware so that workloads are not prioritized but everyone is served equally. This works, of course, but distributing the workload evenly is costly and complex. With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow.
I think the time has come for Linux—and likely other operating systems—to develop a more robust framework for resource management that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries.
Where to Begin?
You cannot schedule resources if you are not monitoring resources. Lots of things are going to need to be monitored so they can be managed. Additionally, the user context must be passed along so that if a user makes an SQL query or request to a daemon, the information about the user that made the request is passed along in the thread.
The following resources need to be monitored so they can be managed in the future:
- Storage queues from the operating system and all the way down to, but not including, the disk drive
- Memory allocations for all types of memory and page sizes
- Network communication queues and resources
Monitoring the queued I/O requests to a file system on a single server might not be that hard, but monitoring the queues to a NAS device or parallel file system is a whole lot more difficult. However, if you do not monitor the file system, you will never be able to control it, and a single user or process or even thread could use a significant portion of the bandwidth, starving higher-priority work.
I am not suggesting the disk queues as this would be very difficult with modern disk drives for two reasons. First, the concept of a user or process at the disk drive or storage device level is not known and moving that framework into the SAS standard would be next to impossible. Second, if the requests sent to the disk drive are the higher-priority requests, then the file system will be controlling the disk drive, so monitoring and therefore management should be done at a higher level.
The need to manage memory is a big concern. We have now have various page pools for various sizes of large pages. We also have shared memory and soon we will have non-volatile memories to deal with. Who should be able to allocate from a page pool and who cannot? What happens if there is not enough of a memory resource?
Being able to monitor the various memory resources is going to be important, given that this is likely the critical resource for many environments.
Today the CPUs are pretty well monitored and scheduled as a resource. Though, as you will see, I would change the scheduling algorithm a bit for some workloads.
I believe that this is similar to storage and that the user context needs to be moved down so that users get the allocations of bandwidth that they are allowed rather than a free for all. A good example would be an eight-core CPU utilized by a high-priority user and a low-priority user. If the high-priority user is utilizing four cores and four cores are free, the low-priority user could be making requests to use the network, and since the cores are free that could impact the high-priority user’s network usage.
My Straw Proposal
I will make the assumption that everything at every level is monitored and tracked by user and, likely, by group and for SELinux potentially by security level and compartment. In addition, I propose adding something like a project identifier. Users and groups are used for permissions, while projects are often used for accounting in some operating systems. For example you might be in the HR group, but for accounting purposes you might want to be able to charge your time to the Seattle HR group or maybe even to the development group if you are doing recruiting for them. So let’s assume that you have resource management by:
- Security level and compartment