Open8

Gremlinでカオスエンジニアリング

daiskobadaiskoba

インストール

サポート環境

  • IaaS, Docker, k8s, Windowsにインストールが可能

Compatibility に起こせる障害の比較表がある.
起こしたい障害のレイヤでエージェントを入れる,

OSに依存している処理や,OS障害を起こすならOSにインストールが必要.

helm install

  • k8sを利用するため,Helmを使う
    • brew install heml

Kubernetes gremlin install

https://www.gremlin.com/docs/infrastructure-layer/installation/#installation

実行するhelm chart

  • チーム名とクラスタIDの設定
    GREMLIN_TEAM_SECRET も事前に作成しておく

右上メンバーアイコン -> Team Settings -> Configuration -> Secret Key

export GREMLIN_TEAM_ID="<id>"
export GREMLIN_CLUSTER_ID="gremlins"
export GREMLIN_TEAM_SECRET="<your secret>"

helmを実行する

$ helm install gremlin gremlin/gremlin \
--namespace gremlin \
--set gremlin.secret.managed=true \
--set gremlin.secret.type=secret \
--set gremlin.secret.teamID=<team id> \
--set gremlin.secret.clusterID=$GREMLIN_CLUSTER_ID \
--set gremlin.secret.teamSecret=$GREMLIN_TEAM_SECRET

NAME: gremlin
LAST DEPLOYED: Mon Nov 22 23:31:24 2021
NAMESPACE: gremlin
STATUS: deployed
REVISION: 1
TEST SUITE: None
 helm ls --namespace gremlin
NAME   	NAMESPACE	REVISION	UPDATED                            	STATUS  	CHART        	APP VERSION
gremlin	gremlin  	1       	2021-11-22 23:31:24.30843 +0900 JST	deployed	gremlin-0.4.7	2.16.2

gremlin GUI

worker nodeがGUIに表示される

https://app.gremlin.com/clients/hosts

worker nodeにPodがデプロイされている

k get po -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS        AGE
gremlin       chao-645585ddd-t84jk                       1/1     Running   0               6m27s
gremlin       gremlin-fktfb                              1/1     Running   0               6m27s
gremlin       gremlin-mk87g                              1/1     Running   0               6m27s
kube-system   calico-kube-controllers-6d8ccdbf46-pv79z   1/1     Running   22 (2d8h ago)   185d

Ubuntu gremlin install

Ubuntu, Debian, etc. 参照

インストール後に起動しておく

$ sudo systemctl start gremlin-integrations
$ systemctl status  gremlin-integrations.service
● gremlin-integrations.service - Gremlin Integration Agent
     Loaded: loaded (/etc/systemd/system/gremlin-integrations.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-11-27 14:18:38 JST; 5s ago
   Main PID: 3204038 (gremlin-integra)
      Tasks: 4 (limit: 18951)
     Memory: 984.0K
     CGroup: /system.slice/gremlin-integrations.service
             └─3204038 /usr/sbin/gremlin-integrations

Nov 27 14:18:38 k8s3 systemd[1]: Started Gremlin Integration Agent.
Nov 27 14:18:38 k8s3 gremlin-integrations[3204038]: 2021-11-27 14:18:38 [INFO] gremlin_integrations - Logging initializeds[3203173]: 2021-11-27 14:16:52 [INFO] gremlin_integrations - Logging initialized
daiskobadaiskoba

Pricing

https://www.gremlin.com/pricing/?ref=webapp-upgrade-panel

  • グレードに合わせて変わる

    • チーム数
    • インストールできるエージェント数
    • 攻撃できる対象の数
    • サポートレベル
    • エンタープライズレベルのサポート有無(SAML・RBAC・監査)
  • Freeプランは2エージェントまで

  • エージェント数を超えるとHTTP 402が出る(初めてみた

$ gremlin check auth
auth
====================================================
Auth Input Type                      : Certificate
API Response                         : HTTP status client error (402 Payment Required) for url (https://api.gremlin.com/v1/clients/status)
========= Identification ============:
Gremlin Identifier                   : xxxx
Gremlin Identifier Source            : Not supplied explicitly, using the default
Team ID                              : xxxx
Team ID Source                       : config/team_id
========= Certificate Info ==========:
Team Certificate                     : Valid X509
Team Certificate Source              : config/team_certificate
Team Certificate Source Type         : loaded from file: "/var/lib/gremlin/team-gremlins-client.pub_cert.pem"
========= Private Key Info ==========:
Team Private Key                     : Valid ECDSA Private Key
Team Private Key Source              : config/team_private_key
Team Certificate Source Type         : loaded from file: "/var/lib/gremlin/team-gremlins-client.priv_key.pem"
Certificate Authorization header     : SIGNATURE ...azhBPQ==
daiskobadaiskoba

Kubenetesで障害を起こす

  • Stackから障害を選択する

  • 障害を組み合わせたシナリオが事前に配布されている

CPU高負荷

  • 1分間の負荷を連続的にかけるシナリオ,使用率がグラフで表示される

nodeのCPUが上昇している

gremlin DaemonSet podが1CPUを超えている

DaemonSetのyaml,障害を起こすために特権が必要

$ k get po gremlin-fktfb -ngremlin -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.244.109.80/32
    cni.projectcalico.org/podIPs: 10.244.109.80/32
  creationTimestamp: "2021-11-22T14:31:24Z"
  generateName: gremlin-
  labels:
    app.kubernetes.io/instance: gremlin
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gremlin
    controller-revision-hash: 6f889cdd44
    helm.sh/chart: gremlin-0.4.7
    pod-template-generation: "1"
    version: v1
  name: gremlin-fktfb
  namespace: gremlin
...
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - k8s2
...
    securityContext:
      capabilities:
        add:
        - KILL
        - NET_ADMIN
        - SYS_BOOT
        - SYS_TIME
        - SYS_ADMIN
        - SYS_PTRACE
        - SETFCAP
        - AUDIT_WRITE
        - MKNOD
        - SYS_CHROOT
        - NET_RAW
...
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
...

パケロス

最長1分のパケロスを連続で起こすシナリオ.障害時間とパケロス率を設定できる.

daiskobadaiskoba

障害の種類

Attacks 参照

  • Resouce
Gremlin	Impact
CPU	Generates high load for one or more CPU cores.
Memory	Allocates a specific amount of RAM.
IO	Puts read/write pressure on I/O devices such as hard disks.
Disk	Writes files to disk to fill it to a specific percentage.
  • 状態変化
Gremlin	Impact
Shutdown	Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time Travel	Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process Killer	Kills the specified process, which can be used to simulate application or dependency crashes. (Note: does not work for PID 1, consider a Shutdown attack instead)
  • ネットワーク
Gremlin	Impact
Blackhole	Drops all matching network traffic.
Latency	Injects latency into all matching egress network traffic.
Packet Loss	Induces packet loss into all matching egress network traffic.
DNS	Blocks access to DNS servers.
daiskobadaiskoba

権限

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml
kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml
  • ClusterRole,CluisterRoleBindingが必要
gremin-chao.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: gremlin
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: chao
    app.kubernetes.io/name: chao
    app.kubernetes.io/version: "1"
  name: chao
  namespace: gremlin
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: chao
      app.kubernetes.io/name: chao
      app.kubernetes.io/version: "1"
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: chao
        app.kubernetes.io/name: chao
        app.kubernetes.io/version: "1"
    spec:
      serviceAccountName: chao
      containers:
      - image: gremlin/chao:latest
        args:
        - "-team_id"
        - "<YOUR TEAM ID GOES HERE>"
        - "-cluster_id"
        - "<YOUR UNIQUE CLUSTER NAME GOES HERE>"
        - "-cert_path"
        - "/gremlin/certs/gremlin.cert"
        - "-key_path"
        - "/gremlin/certs/gremlin.key"
        imagePullPolicy: Always
        name: chao
        volumeMounts:
        - mountPath: /gremlin/certs
          name: gremlin-cert
          readOnly: true
      volumes:
      - name: gremlin-cert
        secret:
          defaultMode: 420
          secretName: gremlin-team-cert
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chao
  namespace: gremlin
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gremlin-watcher
rules:
  - apiGroups: ["apps"]
    resources: ["replicasets", "deployments", "statefulsets", "daemonsets"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["pods", "nodes", "services"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: chao
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gremlin-watcher
subjects:
  - kind: ServiceAccount
    name: chao
    namespace: gremlin
gremlin-client.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: gremlin
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gremlin
  namespace: gremlin
  labels:
    k8s-app: gremlin
    version: v1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gremlin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: gremlin
    spec:
      # If you want to enable host-level process-killing, add this flag:
      #hostPID: true
      # If you want to enable host-level network attacks, add this flag:
      #hostNetwork: true
      containers:
      - name: gremlin
        image: gremlin/gremlin
        args: [ "daemon" ]
        imagePullPolicy: Always
        securityContext:
          capabilities:
            add:
              - NET_ADMIN
              - SYS_BOOT
              - SYS_TIME
              - KILL
        env:
          - name: GREMLIN_TEAM_ID
            value: <YOUR TEAM ID GOES HERE>
          - name: GREMLIN_TEAM_PRIVATE_KEY_OR_FILE
            value: file:///var/lib/gremlin/cert/gremlin.key
          - name: GREMLIN_TEAM_CERTIFICATE_OR_FILE
            value: file:///var/lib/gremlin/cert/gremlin.cert
          - name: GREMLIN_IDENTIFIER
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        volumeMounts:
          - name: docker-sock
            mountPath: /var/run/docker.sock
          - name: gremlin-state
            mountPath: /var/lib/gremlin
          - name: gremlin-logs
            mountPath: /var/log/gremlin
          - name: shutdown-trigger
            mountPath: /sysrq
          - name: gremlin-cert
            mountPath: /var/lib/gremlin/cert
            readOnly: true
      volumes:
        # Gremlin uses the Docker socket to discover eligible containers to attack,
        # and to launch Gremlin sidecar containers
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
        # The Gremlin daemon communicates with Gremlin sidecars via its state directory.
        # This should be shared with the Kubernetes host
        - name: gremlin-state
          hostPath:
            path: /var/lib/gremlin
        # The Gremlin daemon forwards logs from the Gremlin sidecars to the Gremlin control plane
        # These logs should be shared with the host
        - name: gremlin-logs
          hostPath:
            path: /var/log/gremlin
        # If you want to run shutdown attacks on the host, the Gremlin Daemon requires a /proc/sysrq-trigger:/sysrq mount
        - name: shutdown-trigger
          hostPath:
            path: /proc/sysrq-trigger
        - name: gremlin-cert
          secret:
            secretName: gremlin-team-cert
daiskobadaiskoba

ホストOSで障害を起こす

CPU高負荷

gremlinデーモンが負荷をかけている

top - 15:43:17 up 3 min,  1 user,  load average: 1.45, 0.50, 0.19
Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 97.0 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 98.7 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15881.3 total,  14453.9 free,    371.3 used,   1056.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  15223.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3100 gremlin   20   0   32812   9600   8528 S 392.7   0.1   1:21.72 gremlin
   1529 gremlin   20   0   24268   9384   8036 S   2.0   0.1   0:00.47 gremlind
  • 当たり前だけれどk8s上では負荷は見えない
Every 2.0s: kubectl top node                                                                              k8s1: Sat Nov 27 15:45:49 2021

NAME   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
k8s1   322m         8%     2989Mi          38%
k8s2   90m          2%     2048Mi          12%
k8s3   3201m        80%    560Mi           3%

パケロス

80%をパケロスさせる
障害を起こした2回目は測定不能になった.

$ iperf -c k8s2
------------------------------------------------------------
Client connecting to k8s2, TCP port 5001
TCP window size:  332 KByte (default)
------------------------------------------------------------
[  3] local 192.168.5.112 port 40836 connected with 192.168.5.111 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
$ iperf -c k8s2
connect failed: Operation now in progress