19 Aug 2023

Debugging Kubernetes: Node taints, Pod toleration and Disk pressure

Background

A couple of days ago, I received a bunch of notifications from the cluster I was managing by alertmanager over the weekend. A lot of the pods were evicted and new instances(pods) were not created at all, they all remained in Pending state. I decided to take a look in the cluster to see what is going on.

[devops@dev.compute.internal ~]# kubectl get po -n apps

NAME                       READY STATUS RESTARTS

service-A-ddfd7c77c-rlvqz   0/1  Evicted    0

service-A-7b55cff489-79k26  0/1  Pending    0

service-B-7b55cff489-8cwhs  0/1  Evicted    0

service-B-7b55cff489-kxv9m  0/1  Pending    0

This was not new, It has happened before, where more memory and CPU were allocated to the pods than the resources available in the server. we solved that by setting just enough resources limit and request to the deployment and let Horizontal Autoscaler(HA) scale pods depending on the metrics-server metrics reporting on memory and CPU usage.

	resources:
      requests:
        memory: 750Mi
        cpu: 500m
      limits:
        memory: 1Gi
        cpu: 1

So I decided to see the detail of one of the pending pod.

[devops@dev.compute.internal ~]# kubectl describe pods/service-A-7b55cff489-79k26 -n apps
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  19s (x15 over 3m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate

Taint and toleration

Taint is a set of properties that allow the node to repel a set of pods depending on the certain condition being met by the node or certain label being added to the node. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints. Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Taints can be used to steer pods away from nodes or evict pods that shouldn’t be running, so this was the reason we were seeing pods getting evicted. The node controller automatically taints a Node when certain conditions are true. These are some of the taints that are built-in:

node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being False.
node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being Unknown.
node.kubernetes.io/out-of-disk: Node becomes out of disk.
node.kubernetes.io/memory-pressure: Node has memory pressure.
node.kubernetes.io/disk-pressure: Node has disk pressure.
node.kubernetes.io/network-unavailable: Node’s network is unavailable.
node.kubernetes.io/unschedulable: Node is unschedulable.

Node controller or kubelet adds relevant taints and a NoExecute taint effect to the nodes depending on the conditions. This taint effect is what evict pods(Evicted state) that are already running. Also, the node lifecycle controller automatically creates taints corresponding to Node conditions with NoSchedule effect this is what made the pods remain in Pending state. you can read more about it here about how Kubelet and Kubernetes manage resources. Now we need hunt to know whats the condition that caused that taint.. Appart from the built in condition we can explicitly set our conditons but we had not set any at the moment.

[devops@dev.compute.internal ~]# kubectl describe  node dev.compute.internal
...

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Tue, 18 May 2020 13:45:44 +0200   Tue, 18 May 2020 11:31:53 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Tue, 18 May 2020 13:45:44 +0200   Tue, 18 May 2020 11:31:53 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True   Tue, 18 May 2020 13:45:44 +0200   Tue, 18 May 2020 11:31:53 +0200    KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Tue, 18 May 2020 13:45:44 +0200   Tue, 18 May 2020 11:31:53 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 18 May 2020 13:45:44 +0200   Tue, 18 May 2020 11:32:19 +0200   KubeletReady                 kubelet is posting ready status
...

so our machine has disk pressure, time to go to work huh?!

Disk pressure

This means that the node has no enough resources than needed by the cluster. Fist lets check the status of kubelet

[devops@dev.compute.internal ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Fri 2020-02-10 10:39:03 CAT; 3 months 8 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 154232 (kubelet)
    Tasks: 118
   Memory: 170.4M
   CGroup: /system.slice/kubelet.service
           └─154232 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2 --cgroup-driver=systemd

May 19 11:28:43 dev.compute.internal kubelet[18808]: W0803 17:02:43.866083   18808 eviction_manager.go:142] Failed to admit pod kube-proxy-z46zm_kube-system(ea8815fc-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]
May 19 11:28:44 dev.compute.internal kubelet[18808]: W0803 17:02:44.466877   18808 eviction_manager.go:142] Failed to admit pod nvidia-device-plugin-daemonset-ldwsr_kube-system(eae3b613-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]
May 19 11:28:45 dev.compute.internal kubelet[18808]: W0803 17:02:45.069755   18808 eviction_manager.go:142] Failed to admit pod calico-node-fjd28_kube-system(eb3fac9b-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]

My first instinct was to remove all unused local volumes and data used by docker since this is dev environment and we deploy every time someone has merged to master, it must have a lot of unused images for. docker volume prune and docker system prune -a did not help much.

Then checked the disk in the node

[devops@dev.compute.internal ~]# df -H
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  34G     0   34G   0% /dev
tmpfs                     34G     0   34G   0% /dev/shm
tmpfs                     34G   43M   34G   1% /run
tmpfs                     34G     0   34G   0% /sys/fs/cgroup
/dev/mapper/centos-root   50G   49G  100M    99% /
/dev/sda2                1.1G  283M  781M  27% /boot
/dev/sda1                210M   12M  198M   6% /boot/efi
/dev/mapper/centos-home  1.1T   41M  1.1T   0% /home

Voilà looks like our root directory is running out of resources. next step would be to allocate much of the resources which are at home to /.

Resize LVM Partition

Sometimes when creating a new server, the drive is partioned with the root, boot and swap, and then all the rest of the space is given to the home directory.

Here, we are going to reduce the size of the /home partition and allocate the remaining space back to the root partition. These are the steps i took

Backup home. Backup all the content in home directory as we are going to recreate the whole diretory
- tar -czvf /root/home-backup.tgz -C /home .
Reduce the Size of the /home Partition
- Unmount home.
  - umount /dev/mapper/centos-home
- Remove home logical volume.
  - lvremove /dev/mapper/centos-home
- Recreate a new 100GB logical volume for /home
  - lvcreate -L 100GB -n home centos
- Format home logical volume
  - mkfs.xfs /dev/centos/home
- Mount it back
  - mount /dev/mapper/centos-home
Extend /root volume with all of the remaining space and resize the filesystem.
- lvextend -r -l +100%FREE /dev/mapper/centos-root
Restore the contents of the /home directory
- tar -xzvf /root/home.tgz -C /home

Now the if you list Block devices of the disk.

[devops@dev.compute.internal ~]# df -H
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  34G     0   34G   0% /dev
tmpfs                     34G     0   34G   0% /dev/shm
tmpfs                     34G   43M   34G   1% /run
tmpfs                     34G     0   34G   0% /sys/fs/cgroup
/dev/mapper/centos-root  628G   48G  581G   8% /
/dev/sda2                1.1G  283M  781M  27% /boot
/dev/sda1                210M   12M  198M   6% /boot/efi
/dev/mapper/centos-home  537G   41M  537G   1% /home
tmpfs                    6.8G   13k  6.8G   1% /run/user/42

After that i restarted kubelet, i did not have to but one cant be so carefully :smile,

[devops@dev.compute.internal ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Fri 2020-02-10 10:39:03 CAT; 3 months 8 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 154232 (kubelet)
    Tasks: 118
   Memory: 170.4M
   CGroup: /system.slice/kubelet.service
           └─154232 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2 --cgroup-driver=systemd

May 19 12:57:55 dev.compute.internal kubelet[14864]: W0519 11:57:55.317331   14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/b36e8f1c-2229-4bb9-9c73-8c54442a2b62/volumes/kubernetes.io~secret/config-volume and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:57:55 dev.compute.internal kubelet[14864]: W0519 11:57:55.317571   14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/b36e8f1c-2229-4bb9-9c73-8c54442a2b62/volumes/kubernetes.io~secret/prometheus-operator-alertmanager-token-k4xtz and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384272   14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/749d9e1e-f588-4a66-92b8-7acddc7cd28f/volumes/kubernetes.io~secret/tls-assets and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384328   14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/749d9e1e-f588-4a66-92b8-7acddc7cd28f/volumes/kubernetes.io~secret/config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384500   14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/a1396204-cbe1-46b3-9007-eb30c806f959/volumes/kubernetes.io~secret/prometheus-operator-grafana-token-ljflf and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699

All the pending pods start running again. i just had to delete all Evicted pods

kubectl get po -A |grep Evicted|awk '{print "kubectl delete po -n ",$1,$2}'|bash -x

Will probably write about how node taints can work in you favour, as how can setting them yourself, appart from the built in can help you manage your resources.

Sources:

Barnabas Makonda

Debugging Kubernetes: Node taints, Pod toleration and Disk pressure

Background

Taint and toleration

Disk pressure

Resize LVM Partition