Sunday, July 16, 2017

Kubernetes Notes (2): Node Prioritization on Resources


When there are multiple nodes with enough resources available to deploy pods, the Kubernetes scheduler selects the node with highest score. Let's discuss how Kubernetes prioritize nodes based on resources.

The Kubernetes scheduler has three algorithm related to resources. The first one is the least_requested algorithm, with which the Kubernetes scheduler tends to spread pods out and keep resource utilization rate on every node low. The algorithm looks like below:

// The unused capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more unused resources the higher the score is.
func calculateUnusedScore(requested int64, capacity int64, node string) int64 {
    if capacity == 0 {
        return 0
    }
    if requested > capacity {
        glog.V(10).Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",
            requested, capacity, node)
        return 0
    }
    return ((capacity - requested) * 10) / capacity
}

allocatableResources := nodeInfo.AllocatableResource()
totalResources := *podRequests
totalResources.MilliCPU += nodeInfo.NonZeroRequest().MilliCPU
totalResources.Memory += nodeInfo.NonZeroRequest().Memory
cpuScore := calculateUnusedScore(totalResources.MilliCPU, allocatableResources.MilliCPU, node.Name)
memoryScore := calculateUnusedScore(totalResources.Memory, allocatableResources.Memory, node.Name)

The final score is the average of cpuScore and memoryScore. The code above shows that the nodes with lower resource utlization rate have higher score, then highter priority to deploy pods. If there are two nodes (with 2 CPU and 4 CPU respectively) available when scheduling a pod requesting 1 CPU, the least_requested algorithm tends to select the node with 4 CPU.

The most_requested algorithm behaves in the opposite way, with which the Kubernetes scheduler tends to deploy pods onto nodes with the highest resource utilization rate. The code to score nodes is below:

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculatUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUSedScore.
func calculateUsedScore(requested int64, capacity int64, node string) int64 {
    if capacity == 0 {
        return 0
    }
    if requested > capacity {
        glog.V(10).Infof("Combined requested resources %d from existing pods exceeds capacity %d on node %s",
            requested, capacity, node)
        return 0
    }
    return (requested * 10) / capacity
}

If there are two nodes (with 2 CPU and 4 CPU respectively) available when scheduling a pod requesting 1 CPU, the most_requested algorithm tends to select the node with 2 CPU.

The third algorithm is balanced_resource_allocation, with which the Kubernetes scheduler tries to balance the utilization rates of CPU and memory. Its related code looks like:

allocatableResources := nodeInfo.AllocatableResource()
totalResources := *podRequests
totalResources.MilliCPU += nodeInfo.NonZeroRequest().MilliCPU
totalResources.Memory += nodeInfo.NonZeroRequest().Memory

cpuFraction := fractionOfCapacity(totalResources.MilliCPU, allocatableResources.MilliCPU)
memoryFraction := fractionOfCapacity(totalResources.Memory, allocatableResources.Memory)
score := int(0)
if cpuFraction >= 1 || memoryFraction >= 1 {
    // if requested >= capacity, the corresponding host should never be preferred.
    score = 0
} else {
    // Upper and lower boundary of difference between cpuFraction and memoryFraction are -1 and 1
    // respectively. Multilying the absolute value of the difference by 10 scales the value to
    // 0-10 with 0 representing well balanced allocation and 10 poorly balanced. Subtracting it from
    // 10 leads to the score which also scales from 0 to 10 while 10 representing well balanced.
    diff := math.Abs(cpuFraction - memoryFraction)
    score = int(10 - diff*10)
}

In the code above, it calculates the CPU and memory utilization rate first, and then their difference. The node with the highest resource utilization rate difference has the lowest priority.

The Kubernetes scheduler doesn't prefer nodes with 100% CPU or memory utilization rate. When a node with 100% CPU or memory utilization, its score is 0 and it is in the lowest priority to deploy pods.

When pods don't request resources explicitly (in Resources.Requests of deployment config), the Kubernetes scheduler treat them with 0.1 CPU and 200M memory requests by default when scoring nodes (non-zero.go):

// For each of these resources, a pod that doesn't request the resource explicitly
// will be treated as having requested the amount indicated below, for the purpose
// of computing priority only. This ensures that when scheduling zero-request pods, such
// pods will not all be scheduled to the machine with the smallest in-use request,
// and that when scheduling regular pods, such pods will not see zero-request pods as
// consuming no resources whatsoever. We chose these values to be similar to the
// resources that we give to cluster addon pods (#10653). But they are pretty arbitrary.
// As described in #11713, we use request instead of limit to deal with resource requirements.
const DefaultMilliCpuRequest int64 = 100             // 0.1 core
const DefaultMemoryRequest int64 = 200 * 1024 * 1024 // 200 MB

The Kubernetes scheduler has --algorithm-provider to config the algorithms to prioritize nodes, which has two options DefaultProvider and ClusterAutoScalerProvider. Both options include the balanced_resource_allocation algorithm. The only difference between these two options is that DefaultProvider uses the least_requested algorithm, while ClusterAutoScalerProvider utilizes the most_requested algorithm.

Tuesday, July 4, 2017

Kubernetes Notes (1): Allocatable Resources

1. What is allocatable resource?

The capacity property in Kubernetes node status represents the total mount of resources (such as CPU, memory) in a node. Some node resources may be reserved for Kubernetes components (Kube-Reserved), and some more may be reserved for other components (System-Reserved). They are allocatable resources when excluding reserved resources from the resource capacity, and it's the allocatable property of node status (https://github.com/kubernetes/kubernetes/blob/release-1.2/docs/proposals/node-allocatable.md).

The more pods deployed onto a node, the less resources available to accommodate containers. However, the allocatable resources reported by the command "kubectl describe node" keeps the same no matter how many pods have been deployed onto the node. It's count intuitive at the first glance. 
2. An issue about allocatable resource
The command "kubectl describe node" always reports 0 on capacity and allocatable CPU and memory of Windows nodes in Kubernetes 1.6.
Output of "kubectl describe node"

If Kubernetes fails to deploy Windows containers, it might be caused by the issue of allocatable property. When Kubernetes selects nodes to deploy containers, it checks whether the nodes have enough resources available, as shown in the following code (https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/predicates/predicates.go):

allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}

In the code above, the variable allocatable is the allocatable property of node status, podRequest is the requested resource of the pod to be deployed (aggregated requested resources of all containers in the pod), and nodeInfo.RequestResource is the total resource of all pods already deployed onto the node. The node becomes a candidate to deploy new pods only when there is enough resource available.

Since the allocatable property of Windows nodes is always 0, the if expression in the code above gets false and Kubernetes fails to find nodes to deploy pods if containers request resources explicitly.

AKS (1) - Five seconds latency when resolving DNS

We intermittently meet 5s latencies in an AKS clusters with CNI when it’s resolving DNS. This article is to summarize what we have learned...