Sunday, July 16, 2017

Kubernetes Notes (2): Node Prioritization on Resources


When there are multiple nodes with enough resources available to deploy pods, the Kubernetes scheduler selects the node with highest score. Let's discuss how Kubernetes prioritize nodes based on resources.

The Kubernetes scheduler has three algorithm related to resources. The first one is the least_requested algorithm, with which the Kubernetes scheduler tends to spread pods out and keep resource utilization rate on every node low. The algorithm looks like below:

// The unused capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more unused resources the higher the score is.
func calculateUnusedScore(requested int64, capacity int64, node string) int64 {
    if capacity == 0 {
        return 0
    }
    if requested > capacity {
        glog.V(10).Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",
            requested, capacity, node)
        return 0
    }
    return ((capacity - requested) * 10) / capacity
}

allocatableResources := nodeInfo.AllocatableResource()
totalResources := *podRequests
totalResources.MilliCPU += nodeInfo.NonZeroRequest().MilliCPU
totalResources.Memory += nodeInfo.NonZeroRequest().Memory
cpuScore := calculateUnusedScore(totalResources.MilliCPU, allocatableResources.MilliCPU, node.Name)
memoryScore := calculateUnusedScore(totalResources.Memory, allocatableResources.Memory, node.Name)

The final score is the average of cpuScore and memoryScore. The code above shows that the nodes with lower resource utlization rate have higher score, then highter priority to deploy pods. If there are two nodes (with 2 CPU and 4 CPU respectively) available when scheduling a pod requesting 1 CPU, the least_requested algorithm tends to select the node with 4 CPU.

The most_requested algorithm behaves in the opposite way, with which the Kubernetes scheduler tends to deploy pods onto nodes with the highest resource utilization rate. The code to score nodes is below:

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculatUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUSedScore.
func calculateUsedScore(requested int64, capacity int64, node string) int64 {
    if capacity == 0 {
        return 0
    }
    if requested > capacity {
        glog.V(10).Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",
            requested, capacity, node)
        return 0
    }
    return (requested * 10) / capacity
}

If there are two nodes (with 2 CPU and 4 CPU respectively) available when scheduling a pod requesting 1 CPU, the most_requested algorithm tends to select the node with 2 CPU.

The third algorithm is balanced_resource_allocation, with which the Kubernetes scheduler tries to balance the utilization rates of CPU and memory. Its related code looks like:

allocatableResources := nodeInfo.AllocatableResource()
totalResources := *podRequests
totalResources.MilliCPU += nodeInfo.NonZeroRequest().MilliCPU
totalResources.Memory += nodeInfo.NonZeroRequest().Memory

cpuFraction := fractionOfCapacity(totalResources.MilliCPU, allocatableResources.MilliCPU)
memoryFraction := fractionOfCapacity(totalResources.Memory, allocatableResources.Memory)
score := int(0)
if cpuFraction >= 1 || memoryFraction >= 1 {
    // if requested >= capacity, the corresponding host should never be preferred.
    score = 0
} else {
    // Upper and lower boundary of difference between cpuFraction and memoryFraction are -1 and 1
    // respectively. Multilying the absolute value of the difference by 10 scales the value to
    // 0-10 with 0 representing well balanced allocation and 10 poorly balanced. Subtracting it from
    // 10 leads to the score which also scales from 0 to 10 while 10 representing well balanced.
    diff := math.Abs(cpuFraction - memoryFraction)
    score = int(10 - diff*10)
}

In the code above, it calculates the CPU and memory utilization rate first, and then their difference. The node with the highest resource utilization rate difference has the lowest priority.

The Kubernetes scheduler doesn't prefer nodes with 100% CPU or memory utilization rate. When a node with 100% CPU or memory utilization, its score is 0 and it is in the lowest priority to deploy pods.

When pods don't request resources explicitly (in Resources.Requests of deployment config), the Kubernetes scheduler treat them with 0.1 CPU and 200M memory requests by default when scoring nodes (non-zero.go):

// For each of these resources, a pod that doesn't request the resource explicitly
// will be treated as having requested the amount indicated below, for the purpose
// of computing priority only. This ensures that when scheduling zero-request pods, such
// pods will not all be scheduled to the machine with the smallest in-use request,
// and that when scheduling regular pods, such pods will not see zero-request pods as
// consuming no resources whatsoever. We chose these values to be similar to the
// resources that we give to cluster addon pods (#10653). But they are pretty arbitrary.
// As described in #11713, we use request instead of limit to deal with resource requirements.
const DefaultMilliCpuRequest int64 = 100             // 0.1 core
const DefaultMemoryRequest int64 = 200 * 1024 * 1024 // 200 MB

The Kubernetes scheduler has --algorithm-provider to config the algorithms to prioritize nodes, which has two options DefaultProvider and ClusterAutoScalerProvider. Both options include the balanced_resource_allocation algorithm. The only difference between these two options is that DefaultProvider uses the least_requested algorithm, while ClusterAutoScalerProvider utilizes the most_requested algorithm.

No comments:

Post a Comment

AKS (1) - Five seconds latency when resolving DNS

We intermittently meet 5s latencies in an AKS clusters with CNI when it’s resolving DNS. This article is to summarize what we have learned...