Open8
Gremlinでカオスエンジニアリング
目的
- Gremlinの導入ができる
- 起こせる障害が分かる
- 障害の検証ができる
- 複数クラスタで動かせる
URL
導入
アカウント登録
-
個人で登録
https://www.gremlin.com/
GET STARTED
-> My Self -
登録が完了するとダッシュボードが見える
https://app.gremlin.com/dashboard
- 2分間の紹介動画を見ておくと概要はつかめる
- GUIでいろんなレイヤの障害が起こせる
インストール
サポート環境
- IaaS, Docker, k8s, Windowsにインストールが可能
Compatibility に起こせる障害の比較表がある.
起こしたい障害のレイヤでエージェントを入れる,
OSに依存している処理や,OS障害を起こすならOSにインストールが必要.
helm install
- k8sを利用するため,Helmを使う
brew install heml
Kubernetes gremlin install
- チーム名とクラスタIDの設定
GREMLIN_TEAM_SECRET
も事前に作成しておく
右上メンバーアイコン -> Team Settings -> Configuration -> Secret Key
export GREMLIN_TEAM_ID="<id>"
export GREMLIN_CLUSTER_ID="gremlins"
export GREMLIN_TEAM_SECRET="<your secret>"
helmを実行する
$ helm install gremlin gremlin/gremlin \
--namespace gremlin \
--set gremlin.secret.managed=true \
--set gremlin.secret.type=secret \
--set gremlin.secret.teamID=<team id> \
--set gremlin.secret.clusterID=$GREMLIN_CLUSTER_ID \
--set gremlin.secret.teamSecret=$GREMLIN_TEAM_SECRET
NAME: gremlin
LAST DEPLOYED: Mon Nov 22 23:31:24 2021
NAMESPACE: gremlin
STATUS: deployed
REVISION: 1
TEST SUITE: None
helm ls --namespace gremlin
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gremlin gremlin 1 2021-11-22 23:31:24.30843 +0900 JST deployed gremlin-0.4.7 2.16.2
gremlin GUI
worker nodeがGUIに表示される
worker nodeにPodがデプロイされている
k get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gremlin chao-645585ddd-t84jk 1/1 Running 0 6m27s
gremlin gremlin-fktfb 1/1 Running 0 6m27s
gremlin gremlin-mk87g 1/1 Running 0 6m27s
kube-system calico-kube-controllers-6d8ccdbf46-pv79z 1/1 Running 22 (2d8h ago) 185d
Ubuntu gremlin install
インストール後に起動しておく
$ sudo systemctl start gremlin-integrations
$ systemctl status gremlin-integrations.service
● gremlin-integrations.service - Gremlin Integration Agent
Loaded: loaded (/etc/systemd/system/gremlin-integrations.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2021-11-27 14:18:38 JST; 5s ago
Main PID: 3204038 (gremlin-integra)
Tasks: 4 (limit: 18951)
Memory: 984.0K
CGroup: /system.slice/gremlin-integrations.service
└─3204038 /usr/sbin/gremlin-integrations
Nov 27 14:18:38 k8s3 systemd[1]: Started Gremlin Integration Agent.
Nov 27 14:18:38 k8s3 gremlin-integrations[3204038]: 2021-11-27 14:18:38 [INFO] gremlin_integrations - Logging initializeds[3203173]: 2021-11-27 14:16:52 [INFO] gremlin_integrations - Logging initialized
Pricing
-
グレードに合わせて変わる
- チーム数
- インストールできるエージェント数
- 攻撃できる対象の数
- サポートレベル
- エンタープライズレベルのサポート有無(SAML・RBAC・監査)
-
Freeプランは2エージェントまで
-
エージェント数を超えるとHTTP 402が出る(初めてみた
$ gremlin check auth
auth
====================================================
Auth Input Type : Certificate
API Response : HTTP status client error (402 Payment Required) for url (https://api.gremlin.com/v1/clients/status)
========= Identification ============:
Gremlin Identifier : xxxx
Gremlin Identifier Source : Not supplied explicitly, using the default
Team ID : xxxx
Team ID Source : config/team_id
========= Certificate Info ==========:
Team Certificate : Valid X509
Team Certificate Source : config/team_certificate
Team Certificate Source Type : loaded from file: "/var/lib/gremlin/team-gremlins-client.pub_cert.pem"
========= Private Key Info ==========:
Team Private Key : Valid ECDSA Private Key
Team Private Key Source : config/team_private_key
Team Certificate Source Type : loaded from file: "/var/lib/gremlin/team-gremlins-client.priv_key.pem"
Certificate Authorization header : SIGNATURE ...azhBPQ==
Kubenetesで障害を起こす
-
Stackから障害を選択する
-
障害を組み合わせたシナリオが事前に配布されている
CPU高負荷
- 1分間の負荷を連続的にかけるシナリオ,使用率がグラフで表示される
nodeのCPUが上昇している
gremlin DaemonSet podが1CPUを超えている
DaemonSetのyaml,障害を起こすために特権が必要
$ k get po gremlin-fktfb -ngremlin -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.244.109.80/32
cni.projectcalico.org/podIPs: 10.244.109.80/32
creationTimestamp: "2021-11-22T14:31:24Z"
generateName: gremlin-
labels:
app.kubernetes.io/instance: gremlin
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: gremlin
controller-revision-hash: 6f889cdd44
helm.sh/chart: gremlin-0.4.7
pod-template-generation: "1"
version: v1
name: gremlin-fktfb
namespace: gremlin
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- k8s2
...
securityContext:
capabilities:
add:
- KILL
- NET_ADMIN
- SYS_BOOT
- SYS_TIME
- SYS_ADMIN
- SYS_PTRACE
- SETFCAP
- AUDIT_WRITE
- MKNOD
- SYS_CHROOT
- NET_RAW
...
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
...
パケロス
最長1分のパケロスを連続で起こすシナリオ.障害時間とパケロス率を設定できる.
障害の種類
Attacks 参照
- Resouce
Gremlin Impact
CPU Generates high load for one or more CPU cores.
Memory Allocates a specific amount of RAM.
IO Puts read/write pressure on I/O devices such as hard disks.
Disk Writes files to disk to fill it to a specific percentage.
- 状態変化
Gremlin Impact
Shutdown Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time Travel Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process Killer Kills the specified process, which can be used to simulate application or dependency crashes. (Note: does not work for PID 1, consider a Shutdown attack instead)
- ネットワーク
Gremlin Impact
Blackhole Drops all matching network traffic.
Latency Injects latency into all matching egress network traffic.
Packet Loss Induces packet loss into all matching egress network traffic.
DNS Blocks access to DNS servers.
権限
kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml
kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml
- ClusterRole,CluisterRoleBindingが必要
gremin-chao.yaml
apiVersion: v1
kind: Namespace
metadata:
name: gremlin
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/instance: chao
app.kubernetes.io/name: chao
app.kubernetes.io/version: "1"
name: chao
namespace: gremlin
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/instance: chao
app.kubernetes.io/name: chao
app.kubernetes.io/version: "1"
template:
metadata:
labels:
app.kubernetes.io/instance: chao
app.kubernetes.io/name: chao
app.kubernetes.io/version: "1"
spec:
serviceAccountName: chao
containers:
- image: gremlin/chao:latest
args:
- "-team_id"
- "<YOUR TEAM ID GOES HERE>"
- "-cluster_id"
- "<YOUR UNIQUE CLUSTER NAME GOES HERE>"
- "-cert_path"
- "/gremlin/certs/gremlin.cert"
- "-key_path"
- "/gremlin/certs/gremlin.key"
imagePullPolicy: Always
name: chao
volumeMounts:
- mountPath: /gremlin/certs
name: gremlin-cert
readOnly: true
volumes:
- name: gremlin-cert
secret:
defaultMode: 420
secretName: gremlin-team-cert
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: chao
namespace: gremlin
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: gremlin-watcher
rules:
- apiGroups: ["apps"]
resources: ["replicasets", "deployments", "statefulsets", "daemonsets"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods", "nodes", "services"]
verbs: ["get", "watch", "list"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chao
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gremlin-watcher
subjects:
- kind: ServiceAccount
name: chao
namespace: gremlin
gremlin-client.yaml
apiVersion: v1
kind: Namespace
metadata:
name: gremlin
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gremlin
namespace: gremlin
labels:
k8s-app: gremlin
version: v1
spec:
selector:
matchLabels:
app.kubernetes.io/name: gremlin
template:
metadata:
labels:
app.kubernetes.io/name: gremlin
spec:
# If you want to enable host-level process-killing, add this flag:
#hostPID: true
# If you want to enable host-level network attacks, add this flag:
#hostNetwork: true
containers:
- name: gremlin
image: gremlin/gremlin
args: [ "daemon" ]
imagePullPolicy: Always
securityContext:
capabilities:
add:
- NET_ADMIN
- SYS_BOOT
- SYS_TIME
- KILL
env:
- name: GREMLIN_TEAM_ID
value: <YOUR TEAM ID GOES HERE>
- name: GREMLIN_TEAM_PRIVATE_KEY_OR_FILE
value: file:///var/lib/gremlin/cert/gremlin.key
- name: GREMLIN_TEAM_CERTIFICATE_OR_FILE
value: file:///var/lib/gremlin/cert/gremlin.cert
- name: GREMLIN_IDENTIFIER
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker.sock
- name: gremlin-state
mountPath: /var/lib/gremlin
- name: gremlin-logs
mountPath: /var/log/gremlin
- name: shutdown-trigger
mountPath: /sysrq
- name: gremlin-cert
mountPath: /var/lib/gremlin/cert
readOnly: true
volumes:
# Gremlin uses the Docker socket to discover eligible containers to attack,
# and to launch Gremlin sidecar containers
- name: docker-sock
hostPath:
path: /var/run/docker.sock
# The Gremlin daemon communicates with Gremlin sidecars via its state directory.
# This should be shared with the Kubernetes host
- name: gremlin-state
hostPath:
path: /var/lib/gremlin
# The Gremlin daemon forwards logs from the Gremlin sidecars to the Gremlin control plane
# These logs should be shared with the host
- name: gremlin-logs
hostPath:
path: /var/log/gremlin
# If you want to run shutdown attacks on the host, the Gremlin Daemon requires a /proc/sysrq-trigger:/sysrq mount
- name: shutdown-trigger
hostPath:
path: /proc/sysrq-trigger
- name: gremlin-cert
secret:
secretName: gremlin-team-cert
ホストOSで障害を起こす
CPU高負荷
gremlinデーモンが負荷をかけている
top - 15:43:17 up 3 min, 1 user, load average: 1.45, 0.50, 0.19
Tasks: 177 total, 2 running, 175 sleeping, 0 stopped, 0 zombie
%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 97.0 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 98.7 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15881.3 total, 14453.9 free, 371.3 used, 1056.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 15223.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3100 gremlin 20 0 32812 9600 8528 S 392.7 0.1 1:21.72 gremlin
1529 gremlin 20 0 24268 9384 8036 S 2.0 0.1 0:00.47 gremlind
- 当たり前だけれどk8s上では負荷は見えない
Every 2.0s: kubectl top node k8s1: Sat Nov 27 15:45:49 2021
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s1 322m 8% 2989Mi 38%
k8s2 90m 2% 2048Mi 12%
k8s3 3201m 80% 560Mi 3%
パケロス
80%をパケロスさせる
障害を起こした2回目は測定不能になった.
$ iperf -c k8s2
------------------------------------------------------------
Client connecting to k8s2, TCP port 5001
TCP window size: 332 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.5.112 port 40836 connected with 192.168.5.111 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 935 Mbits/sec
$ iperf -c k8s2
connect failed: Operation now in progress