We intermittently meet 5s latencies in an AKS clusters with CNI when it’s resolving DNS. This article is to summarize what we have learned from this issue.
Reproducing
Firstly we create an AKS cluster with Azure CNI networking, and create a nginx pod in the cluster. If we run the following script in the pod, we will see intermittent 5s latencies:
loops=$(($1+0))
for ((i=1;i<=$loops;i++))
do
dt=`date '+%d/%m/%Y %H:%M:%S'`
echo "$dt"
curl -A `date +"[mycurltime:%Y-%m-%dT%H:%M:%S.%3N]"` -X GET 'https://www.google.com' --silent --output /dev/null -w " status=%{http_code} %{redirect_url} size=%{size_download} time_namelookup=%{time_namelookup} time_connect=%{time_connect} time_appconnect=%{time_appconnect} time_pretransfer=%{time_pretransfer} time_redirect=%{time_redirect} time_starttransfer=%{time_starttransfer} time=%{time_total} num_redirects=%{num_redirects} speed_download=%{speed_download} speed_upload=%{speed_upload} num_connects=%{num_connects} content-type="%{content_type}" "
sleep 3
done
The log looks like:
status=200 size=12006 time_namelookup=5.517723 time_connect=5.522935 time_appconnect=5.550684 time_pretransfer=5.550865 time_redirect=0.000000 time_starttransfer=5.593997 time=5.594517 num_redirects=0 speed_download=2146.000 speed_upload=0.000 num_connects=1 content-type="text/html; charset=ISO-8859-1"
Notice that time_namelookup is more than 5 seconds.
Root Cause
The following TCPDump logs explains the 5s latency:
23:12:21.269435 IP (tos 0x0, ttl 64, id 38022, offset 0, flags [DF], proto UDP (17), length 78)
nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 59238+ A?www.google.com.svc.cluster.local. (50)
23:12:21.269514 IP (tos 0x0, ttl 64, id 38023, offset 0, flags [DF], proto UDP (17), length 78)
nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 49176+ AAAA?www.google.com.svc.cluster.local. (50)
23:12:21.270342 IP (tos 0x0, ttl 64, id 28830, offset 0, flags [DF], proto UDP (17), length 171)
kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 49176 NXDomain*- 0/1/0 (143)
23:12:21.270373 IP (tos 0x0, ttl 64, id 28831, offset 0, flags [DF], proto UDP (17), length 171)
kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 59238 NXDomain*- 0/1/0 (143)
23:12:21.270469 IP (tos 0x0, ttl 64, id 38024, offset 0, flags [DF], proto UDP (17), length 74)
nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:21.270555 IP (tos 0x0, ttl 64, id 38025, offset 0, flags [DF], proto UDP (17), length 74)
nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 28679+ AAAA?www.google.com.cluster.local.(46)
23:12:21.271963 IP (tos 0x0, ttl 64, id 28832, offset 0, flags [DF], proto UDP (17), length 167)
kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 28679 NXDomain*- 0/1/0 (139)
23:12:26.272196 IP (tos 0x0, ttl 64, id 38291, offset 0, flags [DF], proto UDP (17), length 74)
nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:26.274297 IP (tos 0x0, ttl 64, id 29412, offset 0, flags [DF], proto UDP (17), length 167)
kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 35084 NXDomain*- 0/1/0 (139)
We can see that there are two UDP requests to resolve a domain name, one for A record for IPv4, and the other for AAAA record for IPv6. As shown in the highlighted logs, sometimes we can only receive response for one of the requests, and the other cause 5s latency. It retries 5s later and it succeeds.
Even though it’s easier to reproduce this issue in AKS clusters with CNI, the root cause is not in either AKS or CNI. The root cause is race conditions in Linux kernel. The detailed explanation about the race condition is available in the blog: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.
Workarounds
Since the root cause is about race condition, we can work around the issue if we can avoid the race condition. We have two options available at least.
dnsConfig options
We can update the dnsConfig in pod’s spec to avoid the race condition. For example, we can enable the option single-request-reopen:
dnsConfig:
options:
- name: single-request-reopen
The description about the option is:
single-request-reopen (since glibc 2.9)
Sets RES_SNGLKUPREOP in _res.options. The resolver
uses the same socket for the A and AAAA requests. Some
hardware mistakenly sends back only one reply. When
that happens the client system will sit and wait for
the second reply. Turning this option on changes this
behavior so that if two requests from the same port are
not handled correctly it will close the socket and open
a new one before sending the second request.
Similarly, we can enable the option use-vc to enforce TCP when resolving DNS:
use-vc (since glibc 2.14)
Sets RES_USEVC in _res.options. This option forces the
use of TCP for DNS resolutions.
Notice that these two options has dependencies on glibc, but glibc is unavailable in some images, such as alpine. We have to use other images if we’d like to go with dnsConfig options.
If we put everything together, the pod spec may look like:
apiVersion:
extensions/v1beta1
kind: Deployment
metadata:
name: app
spec:
template:
spec:
dnsConfig:
options:
- name: single-request-reopen
- name: ndots
value: "5"
- name: use-vc
containers:Disable IPv6
We can also disable IPv6 on nodes. When IPv6 is disabled, there are no requests for AAAA records, so there are no race conditions.
We can have a daemonset to disable IPv6 on every nodes, so we don’t have manually to the same for newly provisioned nodes when scaling or upgrading. The following is a sample yaml file for the daemonset:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
component: ds-disable-ipv6
name: ds-disable-ipv6
spec:
selector:
matchLabels:
component: ds-disable-ipv6
tier: node
template:
metadata:
labels:
component: ds-disable-ipv6
tier: node
spec:
containers:
- command:
- nsenter
- --target
- "1"
- --mount
- --uts
- --ipc
- --net
- --pid
- --
- sh
- -c
- |
if ip a | grep "inet6"; then
echo ipv6 is not disabled
if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX_DEFAULT=" | grep "ipv6.disable=1"; then
echo GRUB_CMDLINE_LINUX_DEFAULT is already disabled
else
echo disabling GRUB_CMDLINE_LINUX_DEFAULT
sed -i 's/\bGRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 /g' /etc/default/grub
fi
if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX=" | grep "ipv6.disable=1"; then
echo GRUB_CMDLINE_LINUX is already disable
else
echo disabling GRUB_CMDLINE_LINUX
sed -i 's/\bGRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="ipv6.disable=1 /g' /etc/default/grub
fi
update-grub
reboot
else
echo ipv6 is already disabled
fi
while true; do sleep 100000; done
image: alpine
imagePullPolicy: IfNotPresent
name: sysctl-patch
resources:
requests:
cpu: 10m
securityContext:
privileged: true
dnsPolicy: ClusterFirst
hostPID: true
tolerations:
- effect: NoSchedule
operator: Exists
restartPolicy: Always
updateStrategy:
type: RollingUpdate