Closed6
Chaos meshでカオスエンジニアリング
目的
- Chaos meshの導入ができる
- 起こせる障害が分かる
- 障害の検証ができる
- 複数クラスタで動かせる
URL
導入
ローカル環境での導入
Run Your First Chaos Experiment in 10 Minutes ではkind or Minikubeが紹介されている.
~今回はMac上でkindを使う~ kindはarm未対応だった.
Helmを使ってクラスタにインストール
GKEにインストール
% k get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
gke-chaos-mesh-default-pool-efa3374e-czdh Ready <none> 94m v1.21.5-gke.1802 10.128.0.9 34.71.110.66 Container-Optimized OS from Google 5.4.144+ containerd://1.4.8
gke-chaos-mesh-default-pool-efa3374e-jcfg Ready <none> 94m v1.21.5-gke.1802 10.128.0.8 34.132.77.142 Container-Optimized OS from Google 5.4.144+ containerd://1.4.8
gke-chaos-mesh-default-pool-efa3374e-m6s9 Ready <none> 94m v1.21.5-gke.1802 10.128.0.10 34.66.215.75 Container-Optimized OS from Google 5.4.144+ containerd://1.4.8
% k create ns chaos-testing
% helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock --set prometheus.create=true --set dashboard.create=true --version 2.0.5
board.create=true --version 2.0.5
NAME: chaos-mesh
LAST DEPLOYED: Mon Nov 29 22:30:15 2021
NAMESPACE: chaos-testing
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Make sure chaos-mesh components are running
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh
namespace: chaos-testingで稼働している
% kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh -w
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-86756484c9-n7ntq 1/1 Running 0 34s
chaos-daemon-gmcng 1/1 Running 0 35s
chaos-daemon-m4hcl 1/1 Running 0 35s
chaos-daemon-p88sh 1/1 Running 0 35s
chaos-dashboard-765547fbcb-wdlcr 1/1 Running 0 34s
chaos-prometheus-6b9f855bb4-vgtv5 1/1 Running 0 34s
アーキテクチャ
- Chaos Dashboard
- UIを担当
- Chaos Controller Manager
- スケジューリングと管理を担当
- Chaos Daemon
- Chaos Engineeringの実行を担当
機能
- Basic resource faults:
- PodChaos: simulates Pod failures, such as Pod node restart, Pod's persistent unavailablility, and certain container failures in a specific Pod.
- NetworkChaos: simulates network failures, such as network latency, packet loss, packet disorder, and network partitions.
- DNSChaos: simulates DNS failures, such as the parsing failure of DNS domain name and the wrong IP address returned.
- HTTPChaos: simulates HTTP communication failures, such as HTTP communication latency.
- StressChaos: simulates CPU race or memory race.
- IOChaos: simulates the I/O failure of an application file, such as I/O delays, read and write failures.
- TimeChaos: simulates the time jump exception.
- KernelChaos: simulates kernel failures, such as an exception of the application memory allocation.
- Platform faults:
- AWSChaos: simulates AWS platform failures, such as the AWS node restart.
- GCPChaos: simulates GCP platform failures, such as the GCP node restart.
- Application faults:
- JVMChaos: simulates JVM application failures, such as the function call delay.
GUIを使った設定
User Permission
GUIを提供するchaos-dashboardから設定する.
リモートから接続するためにServiceのポートを調べて,
❯❯❯ kubectl get svc -n chaos-testing
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
chaos-daemon ClusterIP None <none> 31767/TCP,31766/TCP 2d23h
chaos-dashboard NodePort 10.72.13.237 <none> 2333:32669/TCP 2d23h
chaos-mesh-controller-manager ClusterIP 10.72.0.138 <none> 443/TCP,10081/TCP,10080/TCP 2d23h
chaos-prometheus ClusterIP 10.72.0.232 <none> 9090/TCP 2d23h
port-forwardでローカルの8080ポートと繋ぐ.
❯❯❯ kubectl port-forward svc/chaos-dashboard 8080:2333 -nchaos-testing
Forwarding from 127.0.0.1:8080 -> 2333
Forwarding from [::1]:8080 -> 2333
ブラウザで http://localhost:8080/dashboard を開くとRBACの設定画面が表示される.
Click here to generate
からテンプレートを使った権限が設定できる.
- スコープ: クラスタ全体 or namespace単位
- Role: 管理者権限 or viewer
設定例
Cluster全体のmanager権限を付けた例
rbac.yaml
kind: ServiceAccount
apiVersion: v1
metadata:
namespace: default
name: account-default-manager-gocbk
---
kind: ServiceAccount
apiVersion: v1
metadata:
namespace: default
name: account-cluster-manager-bxwkr
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: role-cluster-manager-bxwkr
rules:
- apiGroups: [""]
resources: ["pods", "namespaces"]
verbs: ["get", "watch", "list"]
- apiGroups:
- chaos-mesh.org
resources: [ "*" ]
verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: bind-cluster-manager-bxwkr
subjects:
- kind: ServiceAccount
name: account-cluster-manager-bxwkr
namespace: default
roleRef:
kind: ClusterRole
name: role-cluster-manager-bxwkr
apiGroup: rbac.authorization.k8s.io
クラスタへのRBAC登録とChaosMeshとの紐付け
% kubectl apply -f rbac.yaml
serviceaccount/account-cluster-manager-bxwkr created
clusterrole.rbac.authorization.k8s.io/role-cluster-manager-bxwkr created
clusterrolebinding.rbac.authorization.k8s.io/bind-cluster-manager-bxwkr created
% kubectl describe secrets account-cluster-manager-bxwkr
Name: account-cluster-manager-bxwkr-token-m2dz6
Namespace: default
Labels: <none>
Annotations: kubernetes.io/service-account.name: account-cluster-manager-bxwkr
kubernetes.io/service-account.uid: cc9e63ea-ec00-4c0b-bb65-d1b11a08691f
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1509 bytes
namespace: 7 bytes
token: eyJhbG...
describeで取得したTokenをRBACの設定画面に反映する.
テストの実行
Chaos Meshはテストを experiment
と呼んでいる.
experimentで障害を起こしたいテストを定義して,単体で実行するかworkflowで一連のシナリオとして扱う.
特定のPodにCPU負荷を加えるテストを実行する.
GUIに実行経過が表示されて,
Podに負荷も掛かっている.
このスクラップは2023/01/02にクローズされました