OpenShift etcdの定期バックアップ
Masterノードが3台中2台故障した場合、etcdのリストア作業が必要になります。これに備えてOpenShiftのetcdは定期バックアップをする必要があります。[1]
OpenShiftのデフォルトでは、Masterノード上にetcdバックアップ用のラッパーシェルを用意しています。
また定期バックアップを実装する方法は、下記のRHブログにも記載されています。こちらはCronJobによる実装になります。[2][3]
今回はJP1でのバッチ実行の都合上、CronJobではなくJobリソースで作成する必要がありカスタマイズをしました。
oc debug --as-root node/<master node>
chroot /host
/usr/local/bin/cluster-backup.sh /home/core/assets/backup
[1]: Backing up etcd data - Control plane backup and restore | Backup and restore | OpenShift Container Platform 4.16
[2]: OCP Disaster Recovery Part 1 - How to Create Automated ETCD Backup in Openshift 4.x
[3]: OCP Disaster Recovery Part 4 - How to GitOps-ify Automated etcd Backups to a PersistentVolume in OpenShift 4.x
実施手順
今回デプロイするリソースは、6つです。前提条件としてバックアップ保管用のストレージが作成できていることを前提としています。今回は自宅のNASをNFSで接続しています。
また、ノード上のラッパーシェルを触るため、サービスアカウントに権限を付与します。
- ServiceAccount
- ClusterRole
- ClusterRoleBinding
- PersistentVolume
- PersistentVolumeClaim
- Job
全体の流れは下記の通りです。
- Namespaceの作成
- ServiceAccount、ClusterRole、ClusterRoleBindingの作成
- ServiceAccountへの権限付与
- PV、PVCの作成
- Jobのデプロイ
OpenShiftにはTemplateリソースというものがあります。今回はテンプレートを3つデプロイします。
- ocp-etcd-backup-serviceaccounts.yaml: サービスアカウントアカウント/ロールデプロイ
- ocp-etcd-backup-job-pv.yaml: バックアップ保管用のPV設定
- ocp-etcd-backup-job-template.yaml: ジョブテンプレート
Namespaceの作成
リソース格納用のNamespaceを別途作成します。
oc new-project ocp-etcd-backup --description "Openshift Backup Automation Tool" --display-name "Backup ETCD Automation"
Service Accountの作成
バックアップ用コンテナで使用するサービスアカウントをテンプレートリソースから作成します。
oc apply -f ocp-etcd-backup-serviceaccounts.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-serviceaccounts -p NAMESPACE=ocp-etcd-backup | oc apply -f -
apiVersion: template.openshift.io/v1
kind: Template
metadata:
name: ocp-etcd-backup-serviceaccounts
annotations:
description: "Backup Jobs Templates"
iconClass: "icon-openshift"
objects:
- kind: ServiceAccount
apiVersion: v1
metadata:
name: openshift-backup
namespace: ${NAMESPACE}
labels:
app: openshift-backup
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-etcd-backup
rules:
- apiGroups: [""]
resources:
- "nodes"
verbs: ["get", "list"]
- apiGroups: [""]
resources:
- "pods"
- "pods/log"
verbs: ["get", "list", "create", "delete", "watch"]
- kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: openshift-backup
labels:
app: openshift-backup
subjects:
- kind: ServiceAccount
name: openshift-backup
namespace: ${NAMESPACE}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-etcd-backup
parameters:
- name: NAMESPACE
description: NAMESPACE used for deployment jobs
value: ocp-etcd-backup
required: true
ServiceAccountへの権限付与
今回はマスターノード上のスクリプトにアクセスしたいため、権限を付与します。
oc adm policy add-scc-to-user privileged -z openshift-backup
PV、PVCの作成
ストレージ上にスナップショットを保管するため、PV、PVCを作成する。今回は同じNamespaceの他コンテナがPVを利用できるようにReadWriteManyでデプロイする。
先ほどと同じようにテンプレートから作成をします。
oc apply -f ocp-etcd-backup-job-pv.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-job-pv -p NAMESPACE=ocp-etcd-backup -p VOLUME_NAME=openshift-backup -p STORAGE_SERVER=<Storageのホスト名/IP> -p PVC_STORAGE=100Gi | oc apply -f -
apiVersion: template.openshift.io/v1
kind: Template
metadata:
name: ocp-etcd-backup-job-pv
annotations:
description: "Backup Jobs pv Templates"
iconClass: "icon-openshift"
objects:
- apiVersion: v1
kind: PersistentVolume
metadata:
name: ${VOLUME_NAME}
namespace: ${NAMESPACE}
spec:
capacity:
storage: ${PVC_STORAGE}
accessModes:
- ReadWriteMany
nfs:
path: /mnt/share/etcd-backup
server: ${STORAGE_SERVER}
persistentVolumeReclaimPolicy: Retain
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${VOLUME_NAME}
namespace: ${NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
volumeName: ${VOLUME_NAME}
storageClassName: ""
parameters:
- name: NAMESPACE
description: NAMESPACE used for deployment jobs
value: ocp-etcd-backup
required: true
- name: VOLUME_NAME
description: VOLUME NAME used for deployment jobs
value: openshift-backup
required: true
- name: PVC_STORAGE
description: Volume used for deployment jobs
value: 100Gi
required: true
- name: STORAGE_SERVER
description: Volume hostname or IP address used for deployment jobs
required: true
作成が完了したらPV、PVCを確認します。
oc get pv
oc get pvc -n ocp-etcd-backup
Jobテンプレートの作成
テンプレートからJobリソースを作成、実行結果を確認します。
シェルスクリプト化させるときは、Job Nameを日付などに変更して重複せずに実行させる形式にします。
また1週間程度ジョブのログが見れるようにttlSecondsAfterFinishedを設定しています。
oc apply -f ocp-etcd-backup-job-template.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-job-template -p NAMESPACE=ocp-etcd-backup -p SERVICE_ACCOUNT_NAME=openshift-backup -p JOB_NAME=ocp-etcd-backup-20250119 -p PVC_NAME=openshift-backup | oc apply -f -
$ oc get pvc -n ocp-etcd-backup
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
openshift-backup Bound openshift-backup 100Gi RWX <unset> 3m14s
apiVersion: template.openshift.io/v1
kind: Template
metadata:
name: ocp-etcd-backup-job-template
annotations:
description: "Backup Jobs Templates"
iconClass: "icon-openshift"
objects:
- apiVersion: batch/v1
kind: Job
metadata:
name: ${JOB_NAME}
namespace: ${NAMESPACE}
spec:
parallelism: 1
completions: 1
activeDeadlineSeconds: 1800
backoffLimit: 0
ttlSecondsAfterFinished: 1209600
template:
metadata:
labels:
app: openshift-backup
spec:
backoffLimit: 0
metadata:
labels:
app: openshift-backup
nodeSelector:
node-role.kubernetes.io/master: ''
restartPolicy: Never
activeDeadlineSeconds: 500
serviceAccountName: ${SERVICE_ACCOUNT_NAME}
hostPID: true
hostNetwork: true
enableServiceLinks: true
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
securityContext: {}
containers:
- resources: {}
terminationMessagePath: /dev/termination-log
name: openshift-backup
command:
- /bin/bash
- '-c'
- >-
echo -e '\n\n---\nCreate etcd backup local to master\n' &&
chroot /host /usr/local/bin/cluster-backup.sh /home/core/backup/ &&
echo -e '\n\n---\nCleanup old local etcd backups\n' &&
chroot /host find /home/core/backup/ -type f -mmin +"2" -delete &&
echo -e '\n\n---\nCopy etcd backup to persistent volume\n' &&
mkdir -pv /mnt/backup/$(date "+%F_%H%M%S") &&
cp -v /host/home/core/backup/* /mnt/backup/$(date "+%F_%H%M%S") &&
echo -e "\n\n---\nDelete persistent ETCD backups older then ${DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS} days\n" &&
find /mnt/backup/* -type d -mtime +${DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS} -exec rm -rv {} \; &&
echo -e '\n\n---\nList all etc backups\n' &&
ls -al /mnt/backup/*
env:
- name: DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS
value: "45"
securityContext:
privileged: true
runAsUser: 0
capabilities:
add:
- SYS_CHROOT
imagePullPolicy: Always
volumeMounts:
- name: backup
mountPath: /mnt/backup
- name: host
mountPath: /host
terminationMessagePolicy: File
image: ${IMAGE_NAME}
volumes:
- name: backup
persistentVolumeClaim:
claimName: ${PVC_NAME}
- name: host
hostPath:
path: /
type: Directory
dnsPolicy: ClusterFirst
tolerations:
- key: node-role.kubernetes.io/master
parameters:
- name: NAMESPACE
description: NAMESPACE used for deployment jobs
value: ocp-etcd-backup
required: true
- name: SERVICE_ACCOUNT_NAME
description: SERVICEACCOUNT used for deployment jobs
value: openshift-backup
required: true
- name: JOB_NAME
description: JOB NAME used for deployment jobs
value: openshift-backup
required: true
- name: PVC_NAME
description: PVC used for deployment jobs
value: openshift-backup
required: true
- name: IMAGE_NAME
description: IMAGE used for deployment jobs
value: registry.redhat.io/openshift4/ose-cli
required: true
結果確認
実行したらJobが正しく実行できていることを確認します。
$ oc get jobs -n ocp-etcd-backup
NAME COMPLETIONS DURATION AGE
ocp-etcd-backup-20250119 1/1 35s 45s
$ oc get pod -n ocp-etcd-backup
NAME READY STATUS RESTARTS AGE
ocp-etcd-backup-20250119-xc7ws 0/1 Completed 0 94s
$ oc logs pod/ocp-etcd-backup-20250119-xc7ws -n ocp-etcd-backup
---
Create etcd backup local to master
Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory
Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found!
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-47
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-5
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-7
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-8
9ffab15f27c810e8afb3362002c29f90b6a823c694d10d5e840d16e2df814dac
etcdctl version: 3.5.13
API version: 3.5
{"level":"info","ts":"2025-01-19T13:27:29.546211Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/home/core/backup//snapshot_2025-01-19_132727.db.part"}
{"level":"info","ts":"2025-01-19T13:27:29.553943Z","logger":"client","caller":"v3@v3.5.13/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2025-01-19T13:27:29.554001Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://xxx.xx.xx.xx:2379"}
{"level":"info","ts":"2025-01-19T13:27:32.859437Z","logger":"client","caller":"v3@v3.5.13/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2025-01-19T13:27:33.473352Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://xxx.xx.xx.xx:2379","size":"598 MB","took":"3 seconds ago"}
{"level":"info","ts":"2025-01-19T13:27:33.473462Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/home/core/backup//snapshot_2025-01-19_132727.db"}
Snapshot saved at /home/core/backup//snapshot_2025-01-19_132727.db
{"hash":1173464372,"revision":93743565,"totalKey":63990,"totalSize":597880832}
snapshot db and kube resources are successfully saved to /home/core/backup/
---
Cleanup old local etcd backups
---
Copy etcd backup to persistent volume
mkdir: created directory '/mnt/backup/2025-01-19_132734'
'/host/home/core/backup/snapshot_2025-01-19_132727.db' -> '/mnt/backup/2025-01-19_132734/snapshot_2025-01-19_132727.db'
'/host/home/core/backup/static_kuberesources_2025-01-19_132727.tar.gz' -> '/mnt/backup/2025-01-19_132734/static_kuberesources_2025-01-19_132727.tar.gz'
---
Delete persistent ETCD backups older then 45 days
---
List all etc backups
/mnt/backup/2025-01-17_211841:
total 1232944
drwxr-xr-x. 2 nobody nobody 186 Jan 18 02:26 .
drwxrwxrwx. 4 1000 1000 56 Jan 19 13:27 ..
-rw-------. 1 nobody nobody 631181344 Jan 18 02:20 snapshot_2025-01-18_021954.db
-rw-------. 1 nobody nobody 631181344 Jan 18 02:26 snapshot_2025-01-18_022638.db
-rw-------. 1 nobody nobody 80937 Jan 18 02:20 static_kuberesources_2025-01-18_021954.tar.gz
-rw-------. 1 nobody nobody 80937 Jan 18 02:26 static_kuberesources_2025-01-18_022638.tar.gz
/mnt/backup/2025-01-19_132734:
total 886804
drwxr-xr-x. 2 nobody nobody 96 Jan 19 13:27 .
drwxrwxrwx. 4 1000 1000 56 Jan 19 13:27 ..
-rw-------. 1 nobody nobody 597880864 Jan 19 13:27 snapshot_2025-01-19_132727.db
-rw-------. 1 nobody nobody 84654 Jan 19 13:27 static_kuberesources_2025-01-19_132727.tar.gz
Discussion