We intermittently meet 5s latencies in an AKS clusters with CNI when it’s resolving DNS. This article is to summarize what we have learned from this issue.

Reproducing

Firstly we create an AKS cluster with Azure CNI networking, and create a nginx pod in the cluster. If we run the following script in the pod, we will see intermittent 5s latencies:

loops=$(($1+0))
for ((i=1;i<=$loops;i++))
do
        dt=`date '+%d/%m/%Y %H:%M:%S'`
        echo "$dt"
        curl -A `date +"[mycurltime:%Y-%m-%dT%H:%M:%S.%3N]"` -X GET 'https://www.google.com' --silent --output /dev/null -w " status=%{http_code} %{redirect_url} size=%{size_download} time_namelookup=%{time_namelookup} time_connect=%{time_connect} time_appconnect=%{time_appconnect} time_pretransfer=%{time_pretransfer} time_redirect=%{time_redirect} time_starttransfer=%{time_starttransfer} time=%{time_total} num_redirects=%{num_redirects} speed_download=%{speed_download} speed_upload=%{speed_upload} num_connects=%{num_connects} content-type="%{content_type}" "
        sleep 3
done

The log looks like:

status=200 size=12006 time_namelookup=5.517723 time_connect=5.522935 time_appconnect=5.550684 time_pretransfer=5.550865 time_redirect=0.000000 time_starttransfer=5.593997 time=5.594517 num_redirects=0 speed_download=2146.000 speed_upload=0.000 num_connects=1 content-type="text/html; charset=ISO-8859-1"

Notice that time_namelookup is more than 5 seconds.

Root Cause

The following TCPDump logs explains the 5s latency:

23:12:21.269435 IP (tos 0x0, ttl 64, id 38022, offset 0, flags [DF], proto UDP (17), length 78)
    nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 59238+ A?www.google.com.svc.cluster.local. (50)
23:12:21.269514 IP (tos 0x0, ttl 64, id 38023, offset 0, flags [DF], proto UDP (17), length 78)
    nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 49176+ AAAA?www.google.com.svc.cluster.local. (50)
23:12:21.270342 IP (tos 0x0, ttl 64, id 28830, offset 0, flags [DF], proto UDP (17), length 171)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 49176 NXDomain*- 0/1/0 (143)
23:12:21.270373 IP (tos 0x0, ttl 64, id 28831, offset 0, flags [DF], proto UDP (17), length 171)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 59238 NXDomain*- 0/1/0 (143)
23:12:21.270469 IP (tos 0x0, ttl 64, id 38024, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:21.270555 IP (tos 0x0, ttl 64, id 38025, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 28679+ AAAA?www.google.com.cluster.local.(46)
23:12:21.271963 IP (tos 0x0, ttl 64, id 28832, offset 0, flags [DF], proto UDP (17), length 167)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 28679 NXDomain*- 0/1/0 (139)
23:12:26.272196 IP (tos 0x0, ttl 64, id 38291, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:26.274297 IP (tos 0x0, ttl 64, id 29412, offset 0, flags [DF], proto UDP (17), length 167)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 35084 NXDomain*- 0/1/0 (139)

We can see that there are two UDP requests to resolve a domain name, one for A record for IPv4, and the other for AAAA record for IPv6. As shown in the highlighted logs, sometimes we can only receive response for one of the requests, and the other cause 5s latency. It retries 5s later and it succeeds.

Even though it’s easier to reproduce this issue in AKS clusters with CNI, the root cause is not in either AKS or CNI. The root cause is race conditions in Linux kernel. The detailed explanation about the race condition is available in the blog: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

Workarounds

Since the root cause is about race condition, we can work around the issue if we can avoid the race condition. We have two options available at least.

dnsConfig options

We can update the dnsConfig in pod’s spec to avoid the race condition. For example, we can enable the option single-request-reopen:

dnsConfig:
options:
    - name: single-request-reopen

The description about the option is:

single-request-reopen (since glibc 2.9)
   Sets RES_SNGLKUPREOP in _res.options. The resolver
   uses the same socket for the A and AAAA requests. Some
    hardware mistakenly sends back only one reply. When
    that happens the client system will sit and wait for
    the second reply. Turning this option on changes this
    behavior so that if two requests from the same port are
    not handled correctly it will close the socket and open
    a new one before sending the second request.

Similarly, we can enable the option use-vc to enforce TCP when resolving DNS:

use-vc (since glibc 2.14)
    Sets RES_USEVC in _res.options. This option forces the
    use of TCP for DNS resolutions.

Notice that these two options has dependencies on glibc, but glibc is unavailable in some images, such as alpine. We have to use other images if we’d like to go with dnsConfig options.

If we put everything together, the pod spec may look like:

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

name: app

spec:

template:

spec:

dnsConfig:

options:

- name: single-request-reopen

- name: ndots

value: "5"

- name: use-vc

containers:

Disable IPv6

We can also disable IPv6 on nodes. When IPv6 is disabled, there are no requests for AAAA records, so there are no race conditions.

We can have a daemonset to disable IPv6 on every nodes, so we don’t have manually to the same for newly provisioned nodes when scaling or upgrading. The following is a sample yaml file for the daemonset:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
    component: ds-disable-ipv6
name: ds-disable-ipv6
spec:
selector:
    matchLabels:
      component: ds-disable-ipv6
      tier: node
template:
    metadata:
      labels:
        component: ds-disable-ipv6
        tier: node
    spec:
      containers:
      - command:
        - nsenter
        - --target
        - "1"
        - --mount
        - --uts
        - --ipc
        - --net
        - --pid
        - --
        - sh
        - -c
        - |
          if ip a | grep "inet6"; then
            echo ipv6 is not disabled
            if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX_DEFAULT=" | grep "ipv6.disable=1"; then
              echo GRUB_CMDLINE_LINUX_DEFAULT is already disabled
            else
              echo disabling GRUB_CMDLINE_LINUX_DEFAULT
              sed -i 's/\bGRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 /g' /etc/default/grub
            fi
            if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX=" | grep "ipv6.disable=1"; then
              echo GRUB_CMDLINE_LINUX is already disable
            else
              echo disabling GRUB_CMDLINE_LINUX
              sed -i 's/\bGRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="ipv6.disable=1 /g' /etc/default/grub
            fi
            update-grub
            reboot
          else
            echo ipv6 is already disabled
          fi
          while true; do sleep 100000; done
        image: alpine
        imagePullPolicy: IfNotPresent
        name: sysctl-patch
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
updateStrategy:
    type: RollingUpdate

Harry's Notes

Friday, December 20, 2019

AKS (1) - Five seconds latency when resolving DNS