Open2021/11/27にコメント追加8

Gremlinでカオスエンジニアリング

Chaos Engineering

目的

Gremlinの導入ができる
起こせる障害が分かる
障害の検証ができる
複数クラスタで動かせる

URL

https://www.gremlin.com/

導入

アカウント登録

個人で登録
https://www.gremlin.com/
GET STARTED -> My Self
登録が完了するとダッシュボードが見える
https://app.gremlin.com/dashboard

2分間の紹介動画を見ておくと概要はつかめる
GUIでいろんなレイヤの障害が起こせる

インストール

サポート環境

IaaS, Docker, k8s, Windowsにインストールが可能

Compatibility に起こせる障害の比較表がある．
起こしたい障害のレイヤでエージェントを入れる，

OSに依存している処理や，OS障害を起こすならOSにインストールが必要．

helm install

k8sを利用するため，Helmを使う
- brew install heml

Kubernetes gremlin install

実行するhelm chart

チーム名とクラスタIDの設定
GREMLIN_TEAM_SECRET も事前に作成しておく

右上メンバーアイコン -> Team Settings -> Configuration -> Secret Key

export GREMLIN_TEAM_ID="<id>"
export GREMLIN_CLUSTER_ID="gremlins"
export GREMLIN_TEAM_SECRET="<your secret>"

helmを実行する

$ helm install gremlin gremlin/gremlin \
--namespace gremlin \
--set gremlin.secret.managed=true \
--set gremlin.secret.type=secret \
--set gremlin.secret.teamID=<team id> \
--set gremlin.secret.clusterID=$GREMLIN_CLUSTER_ID \
--set gremlin.secret.teamSecret=$GREMLIN_TEAM_SECRET

NAME: gremlin
LAST DEPLOYED: Mon Nov 22 23:31:24 2021
NAMESPACE: gremlin
STATUS: deployed
REVISION: 1
TEST SUITE: None

 helm ls --namespace gremlin
NAME   	NAMESPACE	REVISION	UPDATED                            	STATUS  	CHART        	APP VERSION
gremlin	gremlin  	1       	2021-11-22 23:31:24.30843 +0900 JST	deployed	gremlin-0.4.7	2.16.2

gremlin GUI

worker nodeがGUIに表示される

https://app.gremlin.com/clients/hosts

worker nodeにPodがデプロイされている

k get po -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS        AGE
gremlin       chao-645585ddd-t84jk                       1/1     Running   0               6m27s
gremlin       gremlin-fktfb                              1/1     Running   0               6m27s
gremlin       gremlin-mk87g                              1/1     Running   0               6m27s
kube-system   calico-kube-controllers-6d8ccdbf46-pv79z   1/1     Running   22 (2d8h ago)   185d

Ubuntu gremlin install

Ubuntu, Debian, etc. 参照

インストール後に起動しておく

$ sudo systemctl start gremlin-integrations
$ systemctl status  gremlin-integrations.service
● gremlin-integrations.service - Gremlin Integration Agent
     Loaded: loaded (/etc/systemd/system/gremlin-integrations.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-11-27 14:18:38 JST; 5s ago
   Main PID: 3204038 (gremlin-integra)
      Tasks: 4 (limit: 18951)
     Memory: 984.0K
     CGroup: /system.slice/gremlin-integrations.service
             └─3204038 /usr/sbin/gremlin-integrations

Nov 27 14:18:38 k8s3 systemd[1]: Started Gremlin Integration Agent.
Nov 27 14:18:38 k8s3 gremlin-integrations[3204038]: 2021-11-27 14:18:38 [INFO] gremlin_integrations - Logging initializeds[3203173]: 2021-11-27 14:16:52 [INFO] gremlin_integrations - Logging initialized

Pricing

グレードに合わせて変わる
- チーム数
- インストールできるエージェント数
- 攻撃できる対象の数
- サポートレベル
- エンタープライズレベルのサポート有無(SAML・RBAC・監査)
Freeプランは2エージェントまで
エージェント数を超えるとHTTP 402が出る(初めてみた

$ gremlin check auth
auth
====================================================
Auth Input Type                      : Certificate
API Response                         : HTTP status client error (402 Payment Required) for url (https://api.gremlin.com/v1/clients/status)
========= Identification ============:
Gremlin Identifier                   : xxxx
Gremlin Identifier Source            : Not supplied explicitly, using the default
Team ID                              : xxxx
Team ID Source                       : config/team_id
========= Certificate Info ==========:
Team Certificate                     : Valid X509
Team Certificate Source              : config/team_certificate
Team Certificate Source Type         : loaded from file: "/var/lib/gremlin/team-gremlins-client.pub_cert.pem"
========= Private Key Info ==========:
Team Private Key                     : Valid ECDSA Private Key
Team Private Key Source              : config/team_private_key
Team Certificate Source Type         : loaded from file: "/var/lib/gremlin/team-gremlins-client.priv_key.pem"
Certificate Authorization header     : SIGNATURE ...azhBPQ==

Kubenetesで障害を起こす

Stackから障害を選択する
障害を組み合わせたシナリオが事前に配布されている

CPU高負荷

1分間の負荷を連続的にかけるシナリオ，使用率がグラフで表示される

nodeのCPUが上昇している

gremlin DaemonSet podが1CPUを超えている

DaemonSetのyaml，障害を起こすために特権が必要

$ k get po gremlin-fktfb -ngremlin -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.244.109.80/32
    cni.projectcalico.org/podIPs: 10.244.109.80/32
  creationTimestamp: "2021-11-22T14:31:24Z"
  generateName: gremlin-
  labels:
    app.kubernetes.io/instance: gremlin
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gremlin
    controller-revision-hash: 6f889cdd44
    helm.sh/chart: gremlin-0.4.7
    pod-template-generation: "1"
    version: v1
  name: gremlin-fktfb
  namespace: gremlin
...
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - k8s2
...
    securityContext:
      capabilities:
        add:
        - KILL
        - NET_ADMIN
        - SYS_BOOT
        - SYS_TIME
        - SYS_ADMIN
        - SYS_PTRACE
        - SETFCAP
        - AUDIT_WRITE
        - MKNOD
        - SYS_CHROOT
        - NET_RAW
...
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
...

パケロス

最長1分のパケロスを連続で起こすシナリオ．障害時間とパケロス率を設定できる．

障害の種類

Attacks 参照

Resouce

Gremlin	Impact
CPU	Generates high load for one or more CPU cores.
Memory	Allocates a specific amount of RAM.
IO	Puts read/write pressure on I/O devices such as hard disks.
Disk	Writes files to disk to fill it to a specific percentage.

状態変化

Gremlin	Impact
Shutdown	Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time Travel	Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process Killer	Kills the specified process, which can be used to simulate application or dependency crashes. (Note: does not work for PID 1, consider a Shutdown attack instead)

ネットワーク

Gremlin	Impact
Blackhole	Drops all matching network traffic.
Latency	Injects latency into all matching egress network traffic.
Packet Loss	Induces packet loss into all matching egress network traffic.
DNS	Blocks access to DNS servers.

権限

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml
kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml

ClusterRole，CluisterRoleBindingが必要

gremin-chao.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: gremlin
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: chao
    app.kubernetes.io/name: chao
    app.kubernetes.io/version: "1"
  name: chao
  namespace: gremlin
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: chao
      app.kubernetes.io/name: chao
      app.kubernetes.io/version: "1"
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: chao
        app.kubernetes.io/name: chao
        app.kubernetes.io/version: "1"
    spec:
      serviceAccountName: chao
      containers:
      - image: gremlin/chao:latest
        args:
        - "-team_id"
        - "<YOUR TEAM ID GOES HERE>"
        - "-cluster_id"
        - "<YOUR UNIQUE CLUSTER NAME GOES HERE>"
        - "-cert_path"
        - "/gremlin/certs/gremlin.cert"
        - "-key_path"
        - "/gremlin/certs/gremlin.key"
        imagePullPolicy: Always
        name: chao
        volumeMounts:
        - mountPath: /gremlin/certs
          name: gremlin-cert
          readOnly: true
      volumes:
      - name: gremlin-cert
        secret:
          defaultMode: 420
          secretName: gremlin-team-cert
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chao
  namespace: gremlin
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gremlin-watcher
rules:
  - apiGroups: ["apps"]
    resources: ["replicasets", "deployments", "statefulsets", "daemonsets"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["pods", "nodes", "services"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: chao
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gremlin-watcher
subjects:
  - kind: ServiceAccount
    name: chao
    namespace: gremlin

gremlin-client.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: gremlin
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gremlin
  namespace: gremlin
  labels:
    k8s-app: gremlin
    version: v1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gremlin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: gremlin
    spec:
      # If you want to enable host-level process-killing, add this flag:
      #hostPID: true
      # If you want to enable host-level network attacks, add this flag:
      #hostNetwork: true
      containers:
      - name: gremlin
        image: gremlin/gremlin
        args: [ "daemon" ]
        imagePullPolicy: Always
        securityContext:
          capabilities:
            add:
              - NET_ADMIN
              - SYS_BOOT
              - SYS_TIME
              - KILL
        env:
          - name: GREMLIN_TEAM_ID
            value: <YOUR TEAM ID GOES HERE>
          - name: GREMLIN_TEAM_PRIVATE_KEY_OR_FILE
            value: file:///var/lib/gremlin/cert/gremlin.key
          - name: GREMLIN_TEAM_CERTIFICATE_OR_FILE
            value: file:///var/lib/gremlin/cert/gremlin.cert
          - name: GREMLIN_IDENTIFIER
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        volumeMounts:
          - name: docker-sock
            mountPath: /var/run/docker.sock
          - name: gremlin-state
            mountPath: /var/lib/gremlin
          - name: gremlin-logs
            mountPath: /var/log/gremlin
          - name: shutdown-trigger
            mountPath: /sysrq
          - name: gremlin-cert
            mountPath: /var/lib/gremlin/cert
            readOnly: true
      volumes:
        # Gremlin uses the Docker socket to discover eligible containers to attack,
        # and to launch Gremlin sidecar containers
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
        # The Gremlin daemon communicates with Gremlin sidecars via its state directory.
        # This should be shared with the Kubernetes host
        - name: gremlin-state
          hostPath:
            path: /var/lib/gremlin
        # The Gremlin daemon forwards logs from the Gremlin sidecars to the Gremlin control plane
        # These logs should be shared with the host
        - name: gremlin-logs
          hostPath:
            path: /var/log/gremlin
        # If you want to run shutdown attacks on the host, the Gremlin Daemon requires a /proc/sysrq-trigger:/sysrq mount
        - name: shutdown-trigger
          hostPath:
            path: /proc/sysrq-trigger
        - name: gremlin-cert
          secret:
            secretName: gremlin-team-cert

ホストOSで障害を起こす

CPU高負荷

gremlinデーモンが負荷をかけている

top - 15:43:17 up 3 min,  1 user,  load average: 1.45, 0.50, 0.19
Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 97.0 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 98.7 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15881.3 total,  14453.9 free,    371.3 used,   1056.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  15223.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3100 gremlin   20   0   32812   9600   8528 S 392.7   0.1   1:21.72 gremlin
   1529 gremlin   20   0   24268   9384   8036 S   2.0   0.1   0:00.47 gremlind

当たり前だけれどk8s上では負荷は見えない

Every 2.0s: kubectl top node                                                                              k8s1: Sat Nov 27 15:45:49 2021

NAME   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
k8s1   322m         8%     2989Mi          38%
k8s2   90m          2%     2048Mi          12%
k8s3   3201m        80%    560Mi           3%

パケロス

80%をパケロスさせる
障害を起こした2回目は測定不能になった．

$ iperf -c k8s2
------------------------------------------------------------
Client connecting to k8s2, TCP port 5001
TCP window size:  332 KByte (default)
------------------------------------------------------------
[  3] local 192.168.5.112 port 40836 connected with 192.168.5.111 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
$ iperf -c k8s2
connect failed: Operation now in progress