CKA (Certified Kubernetes Administrator)/Kode Kloud

10. Troubleshooting - Worker Node Failure

seulseul 2022. 2. 4. 11:46

01. Application Failure
02. Control Plane Failure
03. Worker Node Failure
04. Troubleshoot Network

01. Fix the broken cluster

  • Fix node01
# hint

Step1. Check the status of services on the nodes.
Step2. Check the service logs using 

journalctl -u kubelet

Step3. If it's stopped then start the stopped services.

Alternatively, run the command:

# ssh node01

# service kubelet start


Step1: Check the status of the nodes:

root@controlplane:~# kubectl get nodes
NAME           STATUS     ROLES                  AGE     VERSION
controlplane   Ready      control-plane,master   6m38s   v1.20.0
node01         NotReady   <none>                 4m59s   v1.20.0

Step 2: SSH to node01 and check the status of container runtime (docker, in this case) and the kubelet service.

root@node01:~# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
   Active: inactive (dead) since Sun 2021-07-25 07:46:58 UTC; 5min ago
 Main PID: 1917 (code=exited, status=0/SUCCESS)
Since the kubelet is not running, attempt to start it by running:

root@node01:~# systemctl start kubelet
root@node01:~# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
   Active: active (running) since Sun 2021-07-25 07:53:35 UTC; 2s ago
node01 should go back to ready state now.


02. The cluster is broken again. Investigate and fix the issue.

  • Fix cluster
journalctl -u kubelet -f


kubelet has stopped running on node01 again. 

Since this is a systemd managed system,

we can check the kubelet log by running journalctl.

Here is a snippet showing the error with kubelet:

root@node01:~# journalctl -u kubelet 
Jul 25 07:54:50 node01 kubelet[5681]: F0725 07:54:50.831238    5681 server.go:257]

unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt:

open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory

Jul 25 07:55:01 node01 kubelet[5710]: F0725 07:55:01.339531    5710 server.go:257]


There appears to be a mistake path used for the CA certificate
in the kubelet configuration. This can be corrected 
by updating the file /var/lib/kubelet/config.yaml.

Once this is fixed, restart the kubelet service,
(like we did in the previous question) and node01 should 
return back to a working state.


03. The cluster is broken again. Investigate and fix the issue.

  • Fix Cluster

Check the kubelet.conf file at /etc/kubernetes/kubelet.conf.


Once again the kubelet service has stopped working. Checking the logs, we can see that this time, it is not able to reach the kube-apiserver.

root@node01:~# journalctl -u kubelet 
Jul 25 08:05:26 node01 kubelet[7966]: E0725 08:05:26.426155    7966 reflector.go:138] Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://controlplane:6553/api/v1/pods?fieldSelector=spec.nodeName%3Dnode01&limit=500&resourceVersion=0": dial tcp connect: connection refused
As we can clearly see, kubelet is trying to connect to the API server on the controlplane node on port 6553. This is incorrect.
To fix, correct the port on the kubeconfig file used by the kubelet.

apiVersion: v1
- cluster:
    server: https://controlplane:6443
Restart the kubelet after this change.