CKA (Certified Kubernetes Administrator)/Kode Kloud

05.Cluster Maintenance - Backup and Restore Methods

seulseul 2022. 1. 24. 14:33
Cluster Maintenance

1) OS Upgrades
2) Cluster Upgrade Process
3) Backup and Restore Methods

 

01. We have a working kubernetes cluster with a set of applications running. Let us first explore the setup.

How many deployments exist in the cluster?

 

ask : 2

 

root@controlplane:~# kubectl get deployments
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
blue   3/3     3            3           105s
red    2/2     2            2           105s

 

02. What is the version of ETCD running on the cluster?

Check the ETCD Pod or Process

ask :)  v3.4.13

# Hint

Look at the ETCD Logs OR check the image used by ETCD pod.

# Solution
Look at the ETCD Logs using the command 

kubectl logs etcd-controlplane -n kube-system

or check the image used by the ETCD pod: 

kubectl describe pod etcd-controlplane -n kube-system
 
root@controlplane:~# kubectl describe pod etcd-controlplane -n kube-system
Name:                 etcd-controlplane
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 controlplane/10.67.217.9
Start Time:           Mon, 24 Jan 2022 04:53:00 +0000
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.67.217.9:2379
                      kubernetes.io/config.hash: 985502623bdbef6ebebf0be608405ef3
                      kubernetes.io/config.mirror: 985502623bdbef6ebebf0be608405ef3
                      kubernetes.io/config.seen: 2022-01-24T04:52:57.846251613Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   10.67.217.9
IPs:
  IP:           10.67.217.9
Controlled By:  Node/controlplane
Containers:
  etcd:
    Container ID:  docker://b144d4942e8650ac111a2a5d9c6e3f4ea70c6fb853b748e89bc8e965a7d0ed4d
    Image:         k8s.gcr.io/etcd:3.4.13-0
    Image ID:      docker-pullable://k8s.gcr.io/etcd@sha256:4ad90a11b55313b182afc186b9876c8e891531b8db4c9bf1541953021618d0e2
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://10.67.217.9:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://10.67.217.9:2380
      --initial-cluster=controlplane=https://10.67.217.9:2380
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://10.67.217.9:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://10.67.217.9:2380
      --name=controlplane
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Mon, 24 Jan 2022 04:52:40 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:                100m
      ephemeral-storage:  100Mi
      memory:             100Mi
    Liveness:             http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup:              http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment:          <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute op=Exists
Events:            <none>

 

 

03. ip?

ask :) 127.0.0.1:2379

 

04. Where is the ETCD server certificate file located?

Note this path down as you will need to use it later

# 정답
/etc/kubernetes/pki/etcd/server.crt

# hint
--cert-file=/etc/kubernetes/pki/etcd/server.crt
 

 

05. Where is the ETCD CA Certificate file located?

Note this path down as you will need to use it later.

 

정답 :)

/etc/kubernetes/pki/etcd/ca.crt

hint

 --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

 

06. The master nodes in our cluster are planned for a regular maintenance reboot tonight.

While we do not anticipate anything to go wrong, we are required to take the necessary backups.

Take a snapshot of the ETCD database using the built-in snapshot functionality.

 

클러스터의 마스터 노드는 오늘 밤 정기 유지 관리 재부팅이 예정되어 있습니다.

우리는 잘못될 것으로 예상하지 않지만 필요한 백업을 수행해야 합니다.

내장된 스냅샷 기능을 사용하여 ETCD 데이터베이스의 스냅샷을 만드십시오.

 

Store the backup file at location /opt/snapshot-pre-boot.db

 

  • Backup ETCD to /opt/snapshot-pre-boot.db
# Backup

ETCDCTL_API=3 etcdctl snapshot save /opt/snapshot-pre-boot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert="/etc/kubernetes/pki/etcd/ca.crt" \
--cert="/etc/kubernetes/pki/etcd/server.crt" \
--key="/etc/kubernetes/pki/etcd/server.key"


# hint
Use the etcdctl snapshot save command. 
You will have to make use of additional flags to connect to the ETCD server.
--endpoints: Optional Flag, points to the address where ETCD is running (127.0.0.1:2379)
--cacert: Mandatory Flag (Absolute Path to the CA certificate file)
--cert: Mandatory Flag (Absolute Path to the Server certificate file)
--key:Mandatory Flag (Absolute Path to the Key file)

# solution
root@controlplane:~# ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /opt/snapshot-pre-boot.db


Snapshot saved at /opt/snapshot-pre-boot.db


# my answer
root@controlplane:~# ETCDCTL_API=3 etcdctl snapshot save /opt/snapshot-pre-boot.db \
> --endpoints=https://127.0.0.1:2379 \
> --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
> --cert="/etc/kubernetes/pki/etcd/server.crt" \
> --key="/etc/kubernetes/pki/etcd/server.key"
Snapshot saved at /opt/snapshot-pre-boot.db
root@controlplane:~#

 

07. info Great! 

Let us now wait for the maintenance window to finish. Go get some sleep. (Don't go for real)

Ok

 

It's about 2 AM at Midnight! You get a call!

 

08. Wake up! We have a conference call!

After the reboot the master nodes came back online, but none of our applications are accessible.

Check the status of the applications on the cluster. What's wrong?

1) Deployments are not present
2) Services are not present
3) All of the above
4) Pods are not present

 

# hint

Are you able to see any deployments/pods or services in the default namespace?

 

09. Luckily we took a backup. Restore the original state of the cluster using the backup file.


  • Deployments: 2
  • Services: 3
# hint

Restore the etcd to a new directory from the snapshot 

by using the etcdctl snapshot restore command. 

Once the directory is restored, 

update the ETCD configuration to use the restored directory.


sudo ETCDCTL_API=3 etcdctl snapshot restore \
/home/cloud_user/etcd_backup.db \
--data-dir="/var/lib/etcd" \
--initial-advertise-peer-urls="http://localhost:2380" \
--initial-cluster="default=http://localhost:2380"

 

root@controlplane:~# ETCDCTL_API=3 etcdctl  --data-dir /var/lib/etcd-from-backup \
snapshot restore /opt/snapshot-pre-boot.db


2021-03-25 23:52:59.608547 I | mvcc: restore compact to 6466
2021-03-25 23:52:59.621400 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
root@controlplane:~# 
Note: In this case, we are restoring the snapshot to a different directory but in the same server where we took the backup (the controlplane node) As a result, the only required option for the restore command is the --data-dir.


Next, update the /etc/kubernetes/manifests/etcd.yaml:

We have now restored the etcd snapshot to a new path on the controlplane

- /var/lib/etcd-from-backup, 

so, the only change to be made in the YAML file, 

is to change the hostPath for the volume called etcd-data 

from old directory (/var/lib/etcd) to the new directory /var/lib/etcd-from-backup.

  volumes:
  - hostPath:
      path: /var/lib/etcd-from-backup
      type: DirectoryOrCreate
    name: etcd-data
    
    
With this change, /var/lib/etcd on the container points

to /var/lib/etcd-from-backup on the controlplane (which is what we want)

When this file is updated, the ETCD pod is automatically re-created as this

is a static pod placed under the /etc/kubernetes/manifests directory.

Note: as the ETCD pod has changed it will automatically restart, 

and also kube-controller-manager and kube-scheduler. 

Wait 1-2 to mins for this pods to restart. 

You can run a watch "docker ps | grep etcd" command to see 

when the ETCD pod is restarted.

Note2: If the etcd pod is not getting Ready 1/1, 

then restart it by kubectl delete pod -n kube-system etcd-controlplane 

and wait 1 minute.

Note3: This is the simplest way to make sure that ETCD uses the restored data

after the ETCD pod is recreated. 

You don't have to change anything else.

If you do change --data-dir to /var/lib/etcd-from-backup in the YAML file,

make sure that the volumeMounts for etcd-data is updated as well, 

with the mountPath pointing to /var/lib/etcd-from-backup

(THIS COMPLETE STEP IS OPTIONAL AND NEED NOT BE DONE FOR COMPLETING THE RESTORE)

Bookmark

 

https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/

 

Operating etcd clusters for Kubernetes

etcd is a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data. If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for those data. You can find in-depth information

kubernetes.io