Worker Node Failure
Worker Node Failure
- Check the status of the Nodes. Are they reported as
ReadyorNot Ready:
kubectl get nodes
-
If reported as
Not Ready, run thekubectl describe node <node_name>command. This can help point towards why a node fails.-
Depending on the status of the
kubectl describecommand, it is either set toTrue,FalseorUnknown. -
When a node is out of disk space, the
OutOfDiskflag is set to true. -
When a node is out of memory, the
MemoryPressureflag is set to true. -
When disk capacity is low, the
DiskPressureflag is set to true. -
When there are too many processes, the
PIDPressureflag is set to true. -
If the node as a whole is healthy, the
Readyflag is set to true. -
When a worker node stops communicating with the Master Node, either due to a crash or other reason, the above statuses are set to
Unknown* Can indicate possible loss of a node. Check the `LastHeartbeatTime` field, for the time the node may have crashed. * Check if the worker node is online at all or if it has crashed. * If it is crashed, bring it back up. * Check for CPU, memory and disk issues with tools like `top` and `df -h`. * Check the status of the `kubelet`. Check the `kubelet` logs for possible issues, using commands such as `service kubelet status` and `sudo journalctl -u kubelet`. * Check the `kubelet` certificates and ensure they are not expired. * Check that the certificates are part of the right group and that the certificates are part of the right CA. `openssl x509 -in /var/lib/kubelet/worker-1.crt -text` * Check the `Validity` section and `Issuer`.
-