Friday, December 20, 2019

AKS (1) - Five seconds latency when resolving DNS

We intermittently meet 5s latencies in an AKS clusters with CNI when it’s resolving DNS. This article is to summarize what we have learned from this issue.

Reproducing

Firstly we create an AKS cluster with Azure CNI networking, and create a nginx pod in the cluster. If we run the following script in the pod, we will see intermittent 5s latencies:

loops=$(($1+0))
for ((i=1;i<=$loops;i++))
do
        dt=`date '+%d/%m/%Y %H:%M:%S'`
        echo "$dt"
        curl -A `date +"[mycurltime:%Y-%m-%dT%H:%M:%S.%3N]"` -X GET 'https://www.google.com' --silent --output /dev/null -w " status=%{http_code} %{redirect_url} size=%{size_download} time_namelookup=%{time_namelookup} time_connect=%{time_connect} time_appconnect=%{time_appconnect} time_pretransfer=%{time_pretransfer} time_redirect=%{time_redirect} time_starttransfer=%{time_starttransfer} time=%{time_total} num_redirects=%{num_redirects} speed_download=%{speed_download} speed_upload=%{speed_upload} num_connects=%{num_connects} content-type="%{content_type}" "
        sleep 3
done

The log looks like:

status=200 size=12006 time_namelookup=5.517723 time_connect=5.522935 time_appconnect=5.550684 time_pretransfer=5.550865 time_redirect=0.000000 time_starttransfer=5.593997 time=5.594517 num_redirects=0 speed_download=2146.000 speed_upload=0.000 num_connects=1 content-type="text/html; charset=ISO-8859-1"

Notice that time_namelookup is more than 5 seconds.

Root Cause

The following TCPDump logs explains the 5s latency:

23:12:21.269435 IP (tos 0x0, ttl 64, id 38022, offset 0, flags [DF], proto UDP (17), length 78)
    nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 59238+ A?www.google.com.svc.cluster.local. (50)
23:12:21.269514 IP (tos 0x0, ttl 64, id 38023, offset 0, flags [DF], proto UDP (17), length 78)
    nginx.55103 > kube-dns.kube-system.svc.cluster.local.53: 49176+ AAAA?www.google.com.svc.cluster.local. (50)
23:12:21.270342 IP (tos 0x0, ttl 64, id 28830, offset 0, flags [DF], proto UDP (17), length 171)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 49176 NXDomain*- 0/1/0 (143)
23:12:21.270373 IP (tos 0x0, ttl 64, id 28831, offset 0, flags [DF], proto UDP (17), length 171)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.55103: 59238 NXDomain*- 0/1/0 (143)
23:12:21.270469 IP (tos 0x0, ttl 64, id 38024, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:21.270555 IP (tos 0x0, ttl 64, id 38025, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 28679+ AAAA?www.google.com.cluster.local.(46)
23:12:21.271963 IP (tos 0x0, ttl 64, id 28832, offset 0, flags [DF], proto UDP (17), length 167)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 28679 NXDomain*- 0/1/0 (139)
23:12:26.272196 IP (tos 0x0, ttl 64, id 38291, offset 0, flags [DF], proto UDP (17), length 74)
    nginx.35633 > kube-dns.kube-system.svc.cluster.local.53: 35084+ A?www.google.com.cluster.local. (46)
23:12:26.274297 IP (tos 0x0, ttl 64, id 29412, offset 0, flags [DF], proto UDP (17), length 167)
    kube-dns.kube-system.svc.cluster.local.53 > nginx.35633: 35084 NXDomain*- 0/1/0 (139)

We can see that there are two UDP requests to resolve a domain name, one for A record for IPv4, and the other for AAAA record for IPv6. As shown in the highlighted logs, sometimes we can only receive response for one of the requests, and the other cause 5s latency. It retries 5s later and it succeeds.

Even though it’s easier to reproduce this issue in AKS clusters with CNI, the root cause is not in either AKS or CNI. The root cause is race conditions in Linux kernel. The detailed explanation about the race condition is available in the blog: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

Workarounds

Since the root cause is about race condition, we can work around the issue if we can avoid the race condition. We have two options available at least.

dnsConfig options

We can update the dnsConfig in pod’s spec to avoid the race condition. For example, we can enable the option single-request-reopen:

dnsConfig:
options:
    - name: single-request-reopen

The description about the option is:

single-request-reopen (since glibc 2.9)
   Sets RES_SNGLKUPREOP in _res.options. The resolver
   uses the same socket for the A and AAAA requests. Some
    hardware mistakenly sends back only one reply. When
    that happens the client system will sit and wait for
    the second reply. Turning this option on changes this
    behavior so that if two requests from the same port are
    not handled correctly it will close the socket and open
    a new one before sending the second request.

Similarly, we can enable the option use-vc to enforce TCP when resolving DNS:

use-vc (since glibc 2.14)
    Sets RES_USEVC in _res.options. This option forces the
    use of TCP for DNS resolutions.

Notice that these two options has dependencies on glibc, but glibc is unavailable in some images, such as alpine. We have to use other images if we’d like to go with dnsConfig options.

If we put everything together, the pod spec may look like:

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

name: app

spec:

template:

spec:

dnsConfig:

options:

- name: single-request-reopen

- name: ndots

value: "5"

- name: use-vc

containers:

Disable IPv6

We can also disable IPv6 on nodes. When IPv6 is disabled, there are no requests for AAAA records, so there are no race conditions.

We can have a daemonset to disable IPv6 on every nodes, so we don’t have manually to the same for newly provisioned nodes when scaling or upgrading. The following is a sample yaml file for the daemonset:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
    component: ds-disable-ipv6
name: ds-disable-ipv6
spec:
selector:
    matchLabels:
      component: ds-disable-ipv6
      tier: node
template:
    metadata:
      labels:
        component: ds-disable-ipv6
        tier: node
    spec:
      containers:
      - command:
        - nsenter
        - --target
        - "1"
        - --mount
        - --uts
        - --ipc
        - --net
        - --pid
        - --
        - sh
        - -c
        - |
          if ip a | grep "inet6"; then
            echo ipv6 is not disabled
            if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX_DEFAULT=" | grep "ipv6.disable=1"; then
              echo GRUB_CMDLINE_LINUX_DEFAULT is already disabled
            else
              echo disabling GRUB_CMDLINE_LINUX_DEFAULT
              sed -i 's/\bGRUB_CMDLINE_LINUX_DEFAULT="/GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 /g' /etc/default/grub
            fi
            if cat /etc/default/grub | grep "GRUB_CMDLINE_LINUX=" | grep "ipv6.disable=1"; then
              echo GRUB_CMDLINE_LINUX is already disable
            else
              echo disabling GRUB_CMDLINE_LINUX
              sed -i 's/\bGRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="ipv6.disable=1 /g' /etc/default/grub
            fi
            update-grub
            reboot
          else
            echo ipv6 is already disabled
          fi
          while true; do sleep 100000; done
        image: alpine
        imagePullPolicy: IfNotPresent
        name: sysctl-patch
        resources:
          requests:
            cpu: 10m
        securityContext:
          privileged: true
      dnsPolicy: ClusterFirst
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists
      restartPolicy: Always
updateStrategy:
    type: RollingUpdate

Thursday, August 9, 2018

Azure Container Instance (8): Deploy IIS nanoserver Image

When customers try to deploy their customized image with ACI, they may meet problems. This article provides a workaround for this problem.

Problem

If you customize an image with the following commands:

docker pull microsoft/iis:nanoserver
docker run --name 2048 -d -v C:\2048:C:\data microsoft/iis:nanoserver
docker exec -i 2048 powershell
#inside the container
mkdir C:\sitecontent
copy-item C:\data\* C:\sitecontent -Recurse
Import-Module IIS*
stop-iissite -name "Default First Site" -confirm:$false
remove-iissite -name "Default First Site" -confirm:$false
new-iissite -name demosite -physicalpath C:\sitecontent -bindinginformation "*:80:"
#exit out of the container
docker stop 2048
docker commit 2048 2048_v1
docker login
docker tag 2048_v1 <dockertag>:2048_v1
docker push <dockertag>:2048_v1

and they deploy the image with ACI, you may see the following error from log:

"Service 'w3svc' has been stopped
APPCMD failed with error code 183
Failed to update IIS configuration"

Workaround

It seems that the IIS docker image has a bug similar to https://github.com/Microsoft/iis-docker/issues/42.

We can work around this problem before the bug is fixed. If we build the docker image from Docker file, then we can successfully deploy it with ACI.

1. Define a Dockerfile, as shown below, supposing your site content files are in the folder ./source of the path which contains the Dockerfile:

FROM microsoft/iis:nanoserver
RUN powershell -NoProfile -Command Remove-Item -Recurse C:\inetpub\wwwroot\*
WORKDIR /inetpub/wwwroot
COPY source/ .

2. Build Docker image with command “docker build -t <your-image-name> .” and push it into Docker hub with command “docker push <your-image-name> .”

3. Deploy the image into ACI with CLI command: az container create --resource-group <your-resource-group> --name <your-container-group> --image <your-image-name> --os-type Windows --port 80 --ip-address "Public"

Thursday, August 2, 2018

Azure Container Instance (7): Socket Connections between ACI

This article is going to summarize how to have socket connections between Azure container groups. The container for socket server and client locate in different container, so they are in different virtual machines. The sample code, for both socket server and client, is in Node.js.

Define Socket Server

The following Node.js code creates a socket server, which listens on port 1337:

var net = require('net');

function getTimePrefix() {

return (new Date()).toString() + ' ';

}

var server = net.createServer(function(socket) {

socket.on('data', function(data) {

textChunk = data.toString('utf8');

console.log(getTimePrefix() + textChunk);

socket.write(textChunk);

});

socket.on('error', function(err) {

console.log(getTimePrefix() + err)

})

});

server.listen(1337);

Build Docker Image for Socket Server

The following is the Dockerfile for the socket server:

FROM node:8.2.0-alpine

RUN mkdir -p /usr/src/app

COPY ./app/* /usr/src/app/

WORKDIR /usr/src/app

RUN npm install

EXPOSE 1337/tcp

CMD node /usr/src/app/server.js

Notice it exposes port 1337.

I’ve published the image as containerinstance/socket-server:20180802 in Docker hub.

Create Container Group for Socket Server

We may create a container group for the socket server, with the following definition:

{

"properties": {

"containers": [

{

"name": "socket-server",

"properties": {

"command": [],

"image": "containerinstance/socket-server:20180802",

"ports": [

{

"port": 1337

}

"resources": {

"requests": {

"cpu": 1,

"memoryInGb": 1.5

}

"osType": "Linux",

"ipAddress": {

"ports": [

{

"protocol": "TCP",

"port": 1337

}

"type": "Public"

}

"location": "westus"

}

We can get the public IP address when the container group is created successfully.

Define Socket Client

The following Node.js code creates 200 socket clients, which try to connect to the socket server concurrently:

var connections = 200;

const cp = require('child_process');

var net = require('net');

var host = process.env.SOCKET_SERVER_HOST;

var port = process.env.SOCKET_SERVER_PORT;

console.log('host: ' + host + ', port: ' + port);

function getTimePrefix() {

return (new Date()).toString() + ' ';

}

var onConnected = function(client, connection) {

return function() {

client.write(connection + ': Hello, server! Love, Client.');

};

var onData = function(client, connection) {

return function(data) {

console.log(getTimePrefix() + 'Connection ' + connection + ' received: ' + data);

client.destroy();

};

}

var onClosed = function(connection) {

return function() {

console.log(getTimePrefix() + 'Connection ' + connection + ' closed');

};

var onError = function(connection) {

return function(err) {

console.log(getTimePrefix() + 'Connection ' + connection + ' has error: ' + err);

};

var startClient = function () {

for (var i = 0; i < connections; ++i) {

var client = new net.Socket();

client.connect(port, host, onConnected(client, i));

client.on('data', onData(client, i));

client.on('close', onClosed(i));

client.on('error', onError(i));

}

};

setInterval(startClient, 10000);

The host and port of the socket server are passed as environment variables.

Build Docker Image for Socket Client

The following is the Dockerfile for the socket server:

FROM node:8.2.0-alpine

RUN mkdir -p /usr/src/app

COPY ./app/* /usr/src/app/

WORKDIR /usr/src/app

RUN npm install

CMD node /usr/src/app/client.js

I’ve published the image as containerinstance/socket-client:20180802 in Docker hub.

Create Container Group for Socket Client

We may create a container group for the socket client, with the following definition:

{

"properties": {

"containers": [

{

"name": "socket-client",

"properties": {

"command": [],

"environmentVariables": [

{

"name": "SOCKET_SERVER_HOST",

"value": "<YourSocketServerIP>"

{

"name": "SOCKET_SERVER_PORT",

"value": "<YourSocketServerPort>"

}

"image": "containerinstance/socket-client:20180802",

"ports": [

{

"port": 80

}

"resources": {

"requests": {

"cpu": 1,

"memoryInGb": 1.5

}

"osType": "Linux",

"ipAddress": {

"ports": [

{

"protocol": "TCP",

"port": 80

}

"type": "Public"

}

"location": "westus"

}

Remember to set the value of environment variable SOCKET_SERVER_HOST as the IP of the container group of the socket server and set the value of environment variable SOCKET_SERVER_PORT as 1337.

We can verify that the server and client talk smoothly with container logs. You may read the logs via ACI API or on Azure portal.

Notes

The sample code files for this article have been shared at https://github.com/zhedahht/ACISocketSample/tree/master/SocketBetweenContainerGroups.