Worker Node Failure

Check the status of the Nodes. Are they reported as Ready or Not Ready:

kubectl get nodes

If reported as Not Ready, run the kubectl describe node <node_name> command. This can help point towards why a node fails.

Depending on the status of the kubectl describe command, it is either set to True, False or Unknown.
When a node is out of disk space, the OutOfDisk flag is set to true.
When a node is out of memory, the MemoryPressure flag is set to true.
When disk capacity is low, the DiskPressure flag is set to true.
When there are too many processes, the PIDPressure flag is set to true.
If the node as a whole is healthy, the Ready flag is set to true.

When a worker node stops communicating with the Master Node, either due to a crash or other reason, the above statuses are set to Unknown

  * Can indicate possible loss of a node. Check the `LastHeartbeatTime` field, for the time the node may have crashed.
        
      * Check if the worker node is online at all or if it has crashed.
            
          * If it is crashed, bring it back up.
                
              * Check for CPU, memory and disk issues with tools like `top` and `df -h`.
                    
              * Check the status of the `kubelet`. Check the `kubelet` logs for possible issues, using commands such as `service kubelet status` and `sudo journalctl -u kubelet`.
                    
              * Check the `kubelet` certificates and ensure they are not expired. 
                    
                  * Check that the certificates are part of the right group and that the certificates are part of the right CA.
                        
                      `openssl x509 -in /var/lib/kubelet/worker-1.crt -text`
                            
                          * Check the `Validity` section and `Issuer`.