Debugging Kubernetes: Node taints, Pod toleration and Disk pressure
Background
A couple of days ago, I received a bunch of notifications from the cluster I was managing by alertmanager
over the weekend. A lot of the pods were evicted and new instances(pods) were not created at all, they all remained in Pending
state. I decided to take a look in the cluster to see what is going on.
[devops@dev.compute.internal ~]# kubectl get po -n apps
NAME READY STATUS RESTARTS
service-A-ddfd7c77c-rlvqz 0/1 Evicted 0
service-A-7b55cff489-79k26 0/1 Pending 0
service-B-7b55cff489-8cwhs 0/1 Evicted 0
service-B-7b55cff489-kxv9m 0/1 Pending 0
This was not new, It has happened before, where more memory and CPU were allocated to the pods than the resources available in the server. we solved that by setting just enough resources limit and request to the deployment and let Horizontal Autoscaler(HA) scale pods depending on the metrics-server
metrics reporting on memory and CPU usage.
resources:
requests:
memory: 750Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1
So I decided to see the detail of one of the pending pod.
[devops@dev.compute.internal ~]# kubectl describe pods/service-A-7b55cff489-79k26 -n apps
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 19s (x15 over 3m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate
Taint and toleration
Taint
is a set of properties that allow the node to repel a set of pods depending on the certain condition being met by the node or certain label being added to the node. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
Taints can be used to steer pods away from nodes or evict pods that shouldn’t be running, so this was the reason we were seeing pods getting evicted.
The node controller automatically taints a Node when certain conditions are true. These are some of the taints that are built-in:
node.kubernetes.io/not-ready
: Node is not ready. This corresponds to the NodeCondition Ready beingFalse
.node.kubernetes.io/unreachable
: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready beingUnknown
.node.kubernetes.io/out-of-disk
: Node becomes out of disk.node.kubernetes.io/memory-pressure
: Node has memory pressure.node.kubernetes.io/disk-pressure
: Node has disk pressure.node.kubernetes.io/network-unavailable
: Node’s network is unavailable.node.kubernetes.io/unschedulable
: Node is unschedulable.
Node controller or kubelet
adds relevant taints and a NoExecute
taint effect to the nodes depending on the conditions. This taint effect is what evict pods(Evicted
state) that are already running. Also, the node lifecycle controller automatically creates taints corresponding to Node conditions with NoSchedule
effect this is what made the pods remain in Pending
state. you can read more about it here about how Kubelet and Kubernetes manage resources.
Now we need hunt to know whats the condition that caused that taint.. Appart from the built in condition we can explicitly set our conditons but we had not set any at the moment.
[devops@dev.compute.internal ~]# kubectl describe node dev.compute.internal
...
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 18 May 2020 13:45:44 +0200 Tue, 18 May 2020 11:31:53 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 18 May 2020 13:45:44 +0200 Tue, 18 May 2020 11:31:53 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 18 May 2020 13:45:44 +0200 Tue, 18 May 2020 11:31:53 +0200 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Tue, 18 May 2020 13:45:44 +0200 Tue, 18 May 2020 11:31:53 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 18 May 2020 13:45:44 +0200 Tue, 18 May 2020 11:32:19 +0200 KubeletReady kubelet is posting ready status
...
so our machine has disk pressure, time to go to work huh?!
Disk pressure
This means that the node has no enough resources than needed by the cluster. Fist lets check the status of kubelet
[devops@dev.compute.internal ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2020-02-10 10:39:03 CAT; 3 months 8 days ago
Docs: https://kubernetes.io/docs/
Main PID: 154232 (kubelet)
Tasks: 118
Memory: 170.4M
CGroup: /system.slice/kubelet.service
└─154232 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2 --cgroup-driver=systemd
May 19 11:28:43 dev.compute.internal kubelet[18808]: W0803 17:02:43.866083 18808 eviction_manager.go:142] Failed to admit pod kube-proxy-z46zm_kube-system(ea8815fc-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]
May 19 11:28:44 dev.compute.internal kubelet[18808]: W0803 17:02:44.466877 18808 eviction_manager.go:142] Failed to admit pod nvidia-device-plugin-daemonset-ldwsr_kube-system(eae3b613-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]
May 19 11:28:45 dev.compute.internal kubelet[18808]: W0803 17:02:45.069755 18808 eviction_manager.go:142] Failed to admit pod calico-node-fjd28_kube-system(eb3fac9b-96fb-11e8-80c4-fa163eb0739b) - node has conditions: [DiskPressure]
My first instinct was to remove all unused local volumes and data used by docker since this is dev
environment and we deploy every time someone has merged to master, it must have a lot of unused images for.
docker volume prune
and docker system prune -a
did not help much.
Then checked the disk in the node
[devops@dev.compute.internal ~]# df -H
Filesystem Size Used Avail Use% Mounted on
devtmpfs 34G 0 34G 0% /dev
tmpfs 34G 0 34G 0% /dev/shm
tmpfs 34G 43M 34G 1% /run
tmpfs 34G 0 34G 0% /sys/fs/cgroup
/dev/mapper/centos-root 50G 49G 100M 99% /
/dev/sda2 1.1G 283M 781M 27% /boot
/dev/sda1 210M 12M 198M 6% /boot/efi
/dev/mapper/centos-home 1.1T 41M 1.1T 0% /home
Voilà looks like our root directory is running out of resources. next step would be to allocate much of the resources which are at home to /
.
Resize LVM Partition
Sometimes when creating a new server, the drive is partioned with the root, boot and swap, and then all the rest of the space is given to the home directory.
Here, we are going to reduce the size of the /home partition and allocate the remaining space back to the root partition. These are the steps i took
- Backup home. Backup all the content in home directory as we are going to recreate the whole diretory
tar -czvf /root/home-backup.tgz -C /home .
- Reduce the Size of the /home Partition
- Unmount home.
umount /dev/mapper/centos-home
- Remove home logical volume.
lvremove /dev/mapper/centos-home
- Recreate a new 100GB logical volume for
/home
lvcreate -L 100GB -n home centos
- Format home logical volume
mkfs.xfs /dev/centos/home
- Mount it back
mount /dev/mapper/centos-home
- Unmount home.
- Extend
/root
volume with all of the remaining space and resize the filesystem.lvextend -r -l +100%FREE /dev/mapper/centos-root
- Restore the contents of the /home directory
tar -xzvf /root/home.tgz -C /home
Now the if you list Block devices of the disk.
[devops@dev.compute.internal ~]# df -H
Filesystem Size Used Avail Use% Mounted on
devtmpfs 34G 0 34G 0% /dev
tmpfs 34G 0 34G 0% /dev/shm
tmpfs 34G 43M 34G 1% /run
tmpfs 34G 0 34G 0% /sys/fs/cgroup
/dev/mapper/centos-root 628G 48G 581G 8% /
/dev/sda2 1.1G 283M 781M 27% /boot
/dev/sda1 210M 12M 198M 6% /boot/efi
/dev/mapper/centos-home 537G 41M 537G 1% /home
tmpfs 6.8G 13k 6.8G 1% /run/user/42
After that i restarted kubelet, i did not have to but one cant be so carefully :smile,
[devops@dev.compute.internal ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2020-02-10 10:39:03 CAT; 3 months 8 days ago
Docs: https://kubernetes.io/docs/
Main PID: 154232 (kubelet)
Tasks: 118
Memory: 170.4M
CGroup: /system.slice/kubelet.service
└─154232 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2 --cgroup-driver=systemd
May 19 12:57:55 dev.compute.internal kubelet[14864]: W0519 11:57:55.317331 14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/b36e8f1c-2229-4bb9-9c73-8c54442a2b62/volumes/kubernetes.io~secret/config-volume and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:57:55 dev.compute.internal kubelet[14864]: W0519 11:57:55.317571 14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/b36e8f1c-2229-4bb9-9c73-8c54442a2b62/volumes/kubernetes.io~secret/prometheus-operator-alertmanager-token-k4xtz and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384272 14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/749d9e1e-f588-4a66-92b8-7acddc7cd28f/volumes/kubernetes.io~secret/tls-assets and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384328 14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/749d9e1e-f588-4a66-92b8-7acddc7cd28f/volumes/kubernetes.io~secret/config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
May 19 12:58:15 dev.compute.internal kubelet[14864]: W0519 11:58:15.384500 14864 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/a1396204-cbe1-46b3-9007-eb30c806f959/volumes/kubernetes.io~secret/prometheus-operator-grafana-token-ljflf and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
All the pending pods start running again. i just had to delete all Evicted pods
kubectl get po -A |grep Evicted|awk '{print "kubectl delete po -n ",$1,$2}'|bash -x
Will probably write about how node taints
can work in you favour, as how can setting them yourself, appart from the built in can help you manage your resources.
Sources: