Kubernetes Scheduler Algorithm Combinatorial Modeling
Most people familiar with the Kubernetes scheduler algorithm are aware of the basic mathematical concepts around its logic. Anyone who has also pursued an interest in learning about it at a low level is aware of its relation to the Knapsack and Bin Packing Problem. In this article, we will explore the introductory basics of the Kubernetes scheduler algorithm, its combinatorial model, and conceptual optimization approaches.
A good first stop for a mathematical resource is, as always, Wolfram. Wolfram explains the Knapsack Problem below:
This is great if you have a quantitative background and were frequently subjected to this type of terminology in the past. However, this is likely mostly gibberish for anyone else. For some reason, a variety of mathematical problems have fun nicknames that motivate amusingly simplified explanations. Two examples that instantly come to mind are the Pigeonhole Principle for surjections and the Monty Hall problem (requires knowledge of famous older game shows with donkeys) for Bayesian Probability. Thankfully this is no different for the Knapsack problem.
The Wikipedia article introduces the problem with a qualitative translation:
The knapsack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible. It derives its name from the problem faced by someone who is constrained by a fixed-size knapsack and must fill it with the most valuable items. The problem often arises in resource allocation where the decision makers have to choose from a set of non-divisible projects or tasks under a fixed budget or time constraint, respectively.
In this situation, one can quickly intuit that the knapsacks are the worker nodes, and the items are the workloads. The scheduler must place the workloads on the worker nodes to ensure they all fit within the resource limits (weight is less than or equal to limit), while ensuring the resource utilization is maximized (total value is as large as possible). Of course, on Kubernetes, the items and knapsacks are non-uniformly described and subject to arbitrary constraints (e.g., node selectors, taints, affinities, etc.), but the problem still describes the situation to completeness.
Bin Packing Problem
Well actually, not to full completeness. Unless you are using single node Kubernetes, then you have n-numbered multiple knapsacks. The multiple knapsack variation then becomes similar to the Bin Packing Problem. The same Wikipedia article explains the differences between the two:
It differs from the Bin Packing Problem in that a subset of items can be selected, whereas, in the Bin Packing Problem, all items have to be packed to certain bins. The concept is that there are multiple knapsacks. This may seem like a trivial change, but it is not equivalent to adding to the capacity of the initial knapsack. This variation is used in many loading and scheduling problems in Operations Research and has a Polynomial-time approximation scheme.
We will go ahead and approximate the optimal Kubernetes Scheduler with the Bin Packing Problem. This is particularly reasonable given that the Kubernetes documentation even includes an article specifically explaining Resource Bin Packing for scheduler customization. Wolfram also explains the Bin Packing Problem:
Once again, the relevant Wikipedia article provides a helpful qualitative explanation:
The bin packing problem is an optimization problem, in which items of different sizes must be packed into a finite number of bins or containers, each of a fixed given capacity, in a way that minimizes the number of bins used.
Also, once again, we can intuit the clear connection with workloads (packs) and worker nodes (bins). Sometimes a picture can also assist with literally illustrating the connection. A Kubernetes release presentation at CERN for version 1.19 can be found here. Pages fourteen through twenty-two demonstrate the bin packing scheduler versus the default scheduler with worker node “bins” and workload “packs”. Note that the illustrations do assume uniformity across workloads and worker nodes, and no external constraints. While clearly unrealistic, it does still convey the point accurately.
Note that the Bin Packing Problem is both NP-hard and NP-complete. Therefore, we will avoid directly solving for an optimization and discuss optimization techniques at a high level later.
The article also presents a formal statement for the problem, assuming a discrete representation of the space. This is valid given that both packs and bins (likewise workloads and worker nodes) have discrete limits and measurements. Therefore, we will move forward with the integer linear programming formulation given in that section.
Note that slightly farther down in the article, there are approximation ratios for measuring the performance of the approximation algorithms attempting to optimize the Bin Packing Problem. We are only discussing this topic at a high level in this article and presenting an introduction to the problem. Therefore we are not interested in competing with current publications and the Kubernetes teams for improving the scheduler algorithm, and will avoid presenting our own concrete solutions with their own scores.
Now that we have presented the basics of the problem, we can move on to discussing attempted and hypothetical solutions, and how these relate to the Kubernetes scheduler. Note that we will only discuss solutions to the online version of the Bin Packing Problem. This is because the scheduler does not have perfect information about the workloads for all states for all time. Workloads may be added to and/or removed from the cluster, and thus not all potential states of packs (workloads) at any given moment in time are known.
There are several simple approximation solutions to the bin packing problem. Each will be explained within the context of the Kubernetes scheduler and a Kubernetes cluster.
The first is the Next Fit algorithm. It always maintains the availability of a single worker node capable of executing workloads. When a workload does not fit on the worker node, then the current worker node is removed from consideration, and the workload is placed onto a new worker node. This is obviously rather inefficient with respect to resource allocation, but very computationally efficient. A variation known as Next-k-Fit maintains the availability of
k number of worker nodes in the pool. This improves resource allocation, but delivers diminishing returns as
Another option is First Fit. It keeps all worker nodes under consideration, and attempts to place a workload onto the first worker node where it fits. An even better algorithm that requires more computational power is the Best Fit. It also maintains all worker nodes for scheduling consideration, but attempts to place each new item into the bin with the maximum load into which it fits.
The final class of solutions are the Worst Fit and Almost Worst Fit. Worst Fit will place each workload onto the worker node with minimum load. This algorithm is superior to basically only Next Fit. Almost Worst Fit places each new workload onto the second most empty worker node. If the workload does not fit, then it will place the workload onto the most empty worker node.
There is a potential opportunity here for optimization of the Bin Packing Problem algorithm through genetic algorithms. One could initialize a gene pool of various scheduling algorithms, and then evaluate fitness with a set of workloads to schedule on a set of worker nodes. The problem with this approach would be the differing performance of individual scheduling algorithms with respect to the differences in the various evaluation matrices for determining fitness. Some algorithms may perform much better for some matrices, and then much poorer for others due to specialization.
One could mitigate the high standard deviation in fitness evaluation by evaluating across multiple matrices of workloads and worker nodes before beginning crossover and mutation. Fitness could be more comprehensively determined in this manner. It would certainly require more computational power, but at this time historically CPU and GPU are relatively inexpensive for this type of optimization. This is especially true if the genetic algorithm is well developed.
With this approach, the crossover process would likely need to maintain or discard selection criteria for the worker nodes to schedule workloads. The mutation would likely need to vary on the weight for each selection criterion against the other criteria. In this manner, one could perhaps have a fairly well optimized algorithm for scheduling workloads after six to eight generations. However, the algorithm may still be computationally inefficient.
There is also a potential opportunity here for optimization of the Bin Packing Algorithm through machine learning. In this situation, the optimization would be focused on training a single algorithm. This is in direct contrast to optimization through genetic algorithm where an entire gene pool comprised of the set of possible factors are downselected and iteratively optimized. The problem with this approach is similar to one sometimes seen in genetic algorithms where the gene pool becomes stale or evolves in a suboptimal trajectory. The algorithm could potentially approach a suboptimal solution, and would need to be restarted at least a few times to observe the resulting effects.
One could easily generate training and evaluation data matrices for the algorithm with the machine learning approach. Dynamic sets of worker nodes and workloads could easily be created very quickly, and a human could easily determine the optimal solution for small numbers of worker nodes and workloads. These smaller data sets could then be extrapolated to large data sets inductively, and therefore the optimal solution would still be known. One could also re-use the same data sets by reordering workload introduction for scheduling decisions.
Eventually the machine learning approach would produce an optimized (to some degree) algorithm for scheduling. Both approaches (genetic algorithm and machine learning) would use algorithms derived from the mathematical models. The parameters would then be tuned accordingly. The optimization could then be deconstructed to retrofit the model. However, this numerical approach does mean we are avoiding solving the problem empirically, from first principles, or perhaps even heuristically. It is reasonable to assume we have demonstrated sufficiently that we do not want to attempt that path.
In this article, we explored modeling and optimizing the Kubernetes scheduler algorithm. Thankfully teams of smart people on behalf of various companies are regularly working on this effort. As they continue to improve the Kubernetes scheduler algorithm, we can be certain that Kubernetes will continue to be the best solution for optimized resource allocation. Kubernetes continues to robustly satisfy even the most intensive requirements for application deployments.
If your organization is interested in optimizing Kubernetes deployment architectures to more efficiently execute your workloads, contact Shadow-Soft.