😎

OpenShift etcdの定期バックアップ

2025/01/19に公開

Masterノードが3台中2台故障した場合、etcdのリストア作業が必要になります。これに備えてOpenShiftのetcdは定期バックアップをする必要があります。[1]

OpenShiftのデフォルトでは、Masterノード上にetcdバックアップ用のラッパーシェルを用意しています。
また定期バックアップを実装する方法は、下記のRHブログにも記載されています。こちらはCronJobによる実装になります。[2][3]
今回はJP1でのバッチ実行の都合上、CronJobではなくJobリソースで作成する必要がありカスタマイズをしました。

etcdのバックアップ
oc debug --as-root node/<master node>
chroot /host
/usr/local/bin/cluster-backup.sh /home/core/assets/backup

[1]: Backing up etcd data - Control plane backup and restore | Backup and restore | OpenShift Container Platform 4.16
[2]: OCP Disaster Recovery Part 1 - How to Create Automated ETCD Backup in Openshift 4.x
[3]: OCP Disaster Recovery Part 4 - How to GitOps-ify Automated etcd Backups to a PersistentVolume in OpenShift 4.x

実施手順

今回デプロイするリソースは、6つです。前提条件としてバックアップ保管用のストレージが作成できていることを前提としています。今回は自宅のNASをNFSで接続しています。
また、ノード上のラッパーシェルを触るため、サービスアカウントに権限を付与します。

  • ServiceAccount
  • ClusterRole
  • ClusterRoleBinding
  • PersistentVolume
  • PersistentVolumeClaim
  • Job

全体の流れは下記の通りです。

  • Namespaceの作成
  • ServiceAccount、ClusterRole、ClusterRoleBindingの作成
  • ServiceAccountへの権限付与
  • PV、PVCの作成
  • Jobのデプロイ

OpenShiftにはTemplateリソースというものがあります。今回はテンプレートを3つデプロイします。

  • ocp-etcd-backup-serviceaccounts.yaml: サービスアカウントアカウント/ロールデプロイ
  • ocp-etcd-backup-job-pv.yaml: バックアップ保管用のPV設定
  • ocp-etcd-backup-job-template.yaml: ジョブテンプレート

Namespaceの作成

リソース格納用のNamespaceを別途作成します。

oc new-project ocp-etcd-backup --description "Openshift Backup Automation Tool" --display-name "Backup ETCD Automation"

Service Accountの作成

バックアップ用コンテナで使用するサービスアカウントをテンプレートリソースから作成します。

テンプレートリソースからの作成
oc apply -f ocp-etcd-backup-serviceaccounts.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-serviceaccounts -p NAMESPACE=ocp-etcd-backup | oc apply -f -
ocp-etcd-backup-serviceaccounts.yaml
apiVersion: template.openshift.io/v1
kind: Template
metadata:
  name: ocp-etcd-backup-serviceaccounts
  annotations:
    description: "Backup Jobs Templates"
    iconClass: "icon-openshift"
objects:
- kind: ServiceAccount
  apiVersion: v1
  metadata:
    name: openshift-backup
    namespace: ${NAMESPACE}
    labels:
      app: openshift-backup
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    name: cluster-etcd-backup
  rules:
  - apiGroups: [""]
    resources:
      - "nodes"
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources:
      - "pods"
      - "pods/log"
    verbs: ["get", "list", "create", "delete", "watch"]
- kind: ClusterRoleBinding
  apiVersion: rbac.authorization.k8s.io/v1
  metadata:
    name: openshift-backup
    labels:
      app: openshift-backup
  subjects:
    - kind: ServiceAccount
      name: openshift-backup
      namespace: ${NAMESPACE}
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: cluster-etcd-backup
parameters:
- name: NAMESPACE
  description: NAMESPACE used for deployment jobs
  value: ocp-etcd-backup
  required: true

ServiceAccountへの権限付与

今回はマスターノード上のスクリプトにアクセスしたいため、権限を付与します。

oc adm policy add-scc-to-user privileged -z openshift-backup

PV、PVCの作成

ストレージ上にスナップショットを保管するため、PV、PVCを作成する。今回は同じNamespaceの他コンテナがPVを利用できるようにReadWriteManyでデプロイする。
先ほどと同じようにテンプレートから作成をします。

PV, PVCの作成
oc apply -f ocp-etcd-backup-job-pv.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-job-pv -p NAMESPACE=ocp-etcd-backup -p VOLUME_NAME=openshift-backup -p STORAGE_SERVER=<Storageのホスト名/IP> -p PVC_STORAGE=100Gi | oc apply -f -
ocp-etcd-backup-job-pv.yaml
apiVersion: template.openshift.io/v1
kind: Template
metadata:
  name: ocp-etcd-backup-job-pv
  annotations:
    description: "Backup Jobs pv Templates"
    iconClass: "icon-openshift"
objects:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    name: ${VOLUME_NAME}
    namespace: ${NAMESPACE}
  spec:
    capacity:
      storage: ${PVC_STORAGE}
    accessModes:
    - ReadWriteMany
    nfs:
      path: /mnt/share/etcd-backup
      server: ${STORAGE_SERVER}
    persistentVolumeReclaimPolicy: Retain
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    name: ${VOLUME_NAME}
    namespace: ${NAMESPACE}
  spec:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 100Gi 
    volumeName: ${VOLUME_NAME}
    storageClassName: ""
parameters:
- name: NAMESPACE
  description: NAMESPACE used for deployment jobs
  value: ocp-etcd-backup
  required: true
- name: VOLUME_NAME
  description: VOLUME NAME used for deployment jobs
  value: openshift-backup
  required: true
- name: PVC_STORAGE
  description: Volume used for deployment jobs
  value: 100Gi
  required: true
- name: STORAGE_SERVER
  description: Volume hostname or IP address used for deployment jobs
  required: true

作成が完了したらPV、PVCを確認します。

oc get pv
oc get pvc -n ocp-etcd-backup

Jobテンプレートの作成

テンプレートからJobリソースを作成、実行結果を確認します。
シェルスクリプト化させるときは、Job Nameを日付などに変更して重複せずに実行させる形式にします。
また1週間程度ジョブのログが見れるようにttlSecondsAfterFinishedを設定しています。

Jobテンプレートの作成
oc apply -f ocp-etcd-backup-job-template.yaml -n ocp-etcd-backup
oc process ocp-etcd-backup-job-template -p NAMESPACE=ocp-etcd-backup -p SERVICE_ACCOUNT_NAME=openshift-backup -p JOB_NAME=ocp-etcd-backup-20250119 -p PVC_NAME=openshift-backup | oc apply -f -
PVC作成結果
$ oc get pvc -n ocp-etcd-backup
NAME               STATUS   VOLUME             CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
openshift-backup   Bound    openshift-backup   100Gi      RWX                           <unset>                 3m14s
ocp-etcd-backup-job-template.yaml
apiVersion: template.openshift.io/v1
kind: Template
metadata:
  name: ocp-etcd-backup-job-template
  annotations:
    description: "Backup Jobs Templates"
    iconClass: "icon-openshift"
objects:
- apiVersion: batch/v1
  kind: Job
  metadata:
    name: ${JOB_NAME}
    namespace: ${NAMESPACE}
  spec:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 1800
    backoffLimit: 0
    ttlSecondsAfterFinished: 1209600
    template:
      metadata:
        labels:
          app: openshift-backup
      spec:
        backoffLimit: 0
        metadata:
          labels:
            app: openshift-backup
        nodeSelector:
          node-role.kubernetes.io/master: ''
        restartPolicy: Never
        activeDeadlineSeconds: 500
        serviceAccountName: ${SERVICE_ACCOUNT_NAME}
        hostPID: true
        hostNetwork: true
        enableServiceLinks: true
        schedulerName: default-scheduler
        terminationGracePeriodSeconds: 30
        securityContext: {}
        containers:
          - resources: {}
            terminationMessagePath: /dev/termination-log
            name: openshift-backup
            command:
            - /bin/bash
            - '-c'
            - >-
              echo -e '\n\n---\nCreate etcd backup local to master\n' &&
              chroot /host /usr/local/bin/cluster-backup.sh /home/core/backup/ &&
              echo -e '\n\n---\nCleanup old local etcd backups\n' &&
              chroot /host find /home/core/backup/ -type f -mmin +"2" -delete &&
              echo -e '\n\n---\nCopy etcd backup to persistent volume\n' &&
              mkdir -pv /mnt/backup/$(date "+%F_%H%M%S") &&
              cp -v /host/home/core/backup/* /mnt/backup/$(date "+%F_%H%M%S") &&
              echo -e "\n\n---\nDelete persistent ETCD backups older then ${DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS} days\n" &&
              find /mnt/backup/* -type d -mtime +${DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS} -exec rm -rv {} \; &&
              echo -e '\n\n---\nList all etc backups\n' &&
              ls -al /mnt/backup/*
            env:
            - name: DAYS_TO_KEEP_PERSISTENT_ETCD_BACKUPS
              value: "45"
            securityContext:
              privileged: true
              runAsUser: 0
              capabilities:
                add:
                  - SYS_CHROOT
            imagePullPolicy: Always
            volumeMounts:
              - name: backup
                mountPath: /mnt/backup
              - name: host
                mountPath: /host
            terminationMessagePolicy: File
            image: ${IMAGE_NAME}
        volumes:
        - name: backup
          persistentVolumeClaim:
            claimName: ${PVC_NAME}
        - name: host
          hostPath:
            path: /
            type: Directory
        dnsPolicy: ClusterFirst
        tolerations:
        - key: node-role.kubernetes.io/master
parameters:
- name: NAMESPACE
  description: NAMESPACE used for deployment jobs
  value: ocp-etcd-backup
  required: true
- name: SERVICE_ACCOUNT_NAME
  description: SERVICEACCOUNT used for deployment jobs
  value: openshift-backup
  required: true
- name: JOB_NAME
  description: JOB NAME used for deployment jobs
  value: openshift-backup
  required: true
- name: PVC_NAME
  description: PVC used for deployment jobs
  value: openshift-backup
  required: true
- name: IMAGE_NAME
  description: IMAGE used for deployment jobs
  value: registry.redhat.io/openshift4/ose-cli
  required: true

結果確認

実行したらJobが正しく実行できていることを確認します。

Job結果確認
$ oc get jobs -n ocp-etcd-backup
NAME                       COMPLETIONS   DURATION   AGE
ocp-etcd-backup-20250119   1/1           35s        45s

$ oc get pod -n ocp-etcd-backup
NAME                             READY   STATUS      RESTARTS   AGE
ocp-etcd-backup-20250119-xc7ws   0/1     Completed   0          94s

$ oc logs pod/ocp-etcd-backup-20250119-xc7ws -n ocp-etcd-backup


---
Create etcd backup local to master

Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory
Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found!
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-47
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-5
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-7
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-8
9ffab15f27c810e8afb3362002c29f90b6a823c694d10d5e840d16e2df814dac
etcdctl version: 3.5.13
API version: 3.5
{"level":"info","ts":"2025-01-19T13:27:29.546211Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/home/core/backup//snapshot_2025-01-19_132727.db.part"}
{"level":"info","ts":"2025-01-19T13:27:29.553943Z","logger":"client","caller":"v3@v3.5.13/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2025-01-19T13:27:29.554001Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://xxx.xx.xx.xx:2379"}
{"level":"info","ts":"2025-01-19T13:27:32.859437Z","logger":"client","caller":"v3@v3.5.13/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2025-01-19T13:27:33.473352Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://xxx.xx.xx.xx:2379","size":"598 MB","took":"3 seconds ago"}
{"level":"info","ts":"2025-01-19T13:27:33.473462Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/home/core/backup//snapshot_2025-01-19_132727.db"}
Snapshot saved at /home/core/backup//snapshot_2025-01-19_132727.db
{"hash":1173464372,"revision":93743565,"totalKey":63990,"totalSize":597880832}
snapshot db and kube resources are successfully saved to /home/core/backup/


---
Cleanup old local etcd backups



---
Copy etcd backup to persistent volume

mkdir: created directory '/mnt/backup/2025-01-19_132734'
'/host/home/core/backup/snapshot_2025-01-19_132727.db' -> '/mnt/backup/2025-01-19_132734/snapshot_2025-01-19_132727.db'
'/host/home/core/backup/static_kuberesources_2025-01-19_132727.tar.gz' -> '/mnt/backup/2025-01-19_132734/static_kuberesources_2025-01-19_132727.tar.gz'


---
Delete persistent ETCD backups older then 45 days



---
List all etc backups

/mnt/backup/2025-01-17_211841:
total 1232944
drwxr-xr-x. 2 nobody nobody       186 Jan 18 02:26 .
drwxrwxrwx. 4   1000   1000        56 Jan 19 13:27 ..
-rw-------. 1 nobody nobody 631181344 Jan 18 02:20 snapshot_2025-01-18_021954.db
-rw-------. 1 nobody nobody 631181344 Jan 18 02:26 snapshot_2025-01-18_022638.db
-rw-------. 1 nobody nobody     80937 Jan 18 02:20 static_kuberesources_2025-01-18_021954.tar.gz
-rw-------. 1 nobody nobody     80937 Jan 18 02:26 static_kuberesources_2025-01-18_022638.tar.gz

/mnt/backup/2025-01-19_132734:
total 886804
drwxr-xr-x. 2 nobody nobody        96 Jan 19 13:27 .
drwxrwxrwx. 4   1000   1000        56 Jan 19 13:27 ..
-rw-------. 1 nobody nobody 597880864 Jan 19 13:27 snapshot_2025-01-19_132727.db
-rw-------. 1 nobody nobody     84654 Jan 19 13:27 static_kuberesources_2025-01-19_132727.tar.gz

Discussion