📦

kind(Kubernetes IN Docker)とCNIに関する備忘録

2020/11/24に公開

Kubernetes

container

tech

経緯

CKSの勉強を始めるにあたってTwitter経由でRedHatのWalid Shaariさんのリポジトリを知りました。

そのリポジトリでなぜか kind によるkubernetesのローカル開発環境に関して、わざとCNIを抜いた状態で構築し、その状態について説明する動画が上げられていました。

なんとなく気になって全部見てしまったのだけれど、そのままだと、へーそうなんだで終わってしまう話なので自分なりにまとめた内容と合わせてここに残します。

興味とともにあっちこっちいきます。

kindとは

kind(Kubernetes IN Docker)とは、Dockerコンテナをノードとして使用して、ローカルのKubernetesクラスターを実行するツールです。

開発環境として利用したり、feature gateを有効にして、CloudベンダーのManagedKubernetesには降りてない最新機能を検証したり、Docker上に全て構築されるので使い終わったら綺麗に削除できることから、なにかと便利なツールです。Kubernetes完全ガイドの著者の青山さんが好んで使ってます。

今回はこのkindを使ってCNIを抜いた状態でKubernetesで構築します。

CNIとは

Kubernetesの三大インターフェースのうちの一つ*。コンテナのネットワーク機能を抽象化し、IF仕様として切り出したものです。様々なソフトウェアベンダーがCNIに準拠したネットワークプラグインを提供しており、Kubernetesの豊富なエコシステムを支えています。

個人的な推しはeBPFでネットワークスタックを構成したCilliumです。最近だとGKE上のアプリケーションのネットワーク監視用コンポーネントであるGKE Dataplane V2に採用されており、eBPFとともになにかと熱いソフトウェアだと思います。

Kubernetesの三大インターフェース

自分が勝手に呼んでます。

CRI: Container Runtime Interface
- コンテナを動かすコンテナランタイムに関するIF仕様です
CNI: Container Network Interface
- コンテナネットワークに関するIF仕様です
CSI: Container Storage Interface
- コンテナのストレージに関するIF仕様です

を指します。Kubernetesの拡張性を支える重要な仕様であり、このインターフェース仕様書に準拠した各プラグインをインストールすることでコンテナの細かい業務要件に即した自分だけのコンテナオーケストレーション環境を作ることが可能になります。

Container Object Storage Interfaceという兄弟が近々できそう。

eBPFのここ好き

eBPFって何がおいしいの？ってユーザーにはLinuxカーネルのネットワークフレームワークであるNetfilterのウィキペディアの図を紹介しています。

上部はKubernetesでまだまだ標準的なiptablesに関する処理概要です。長い年月とともにハウルの動く城化しました。一番下でスーパーマリオみたいにショートカットしているのがeBPFです。要はeBPF早い(暴論)。

CNIを抜いた状態でKubernetesを構築する

前提

kindをインストールガイドに従って導入します
Kubernetesを複数台扱う場合欠かせない、k8sのcontext切り替えツールであるctx を krew 経由で導入します

手順

Configurationファイルをkind-1m3w-nocni.yamlって名前で保存します

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

testクラスタをkindコマンドで作成します

$ kind create cluster --config kind-1m3w-nocni.yaml --name test
Creating cluster "test" ...
 ✓ Ensuring node image (kindest/node:v1.19.1) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-test"
You can now use your cluster with:

kubectl cluster-info --context kind-test

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/

kubectl のcontextをtestクラスタに切り替えます

$ kubectl ctx kind-test
Switched to context "kind-test".

kubectl のコマンドを実行して認識できていることを確認します

$ kubectl cluster-info
Kubernetes master is running at https://127.0.0.1:44495
KubeDNS is running at https://127.0.0.1:44495/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

このままではKubernetesとして使えない

kubectlで疎通することはできるのですが、このままではKubernetesとして使えません。
試しに、nginxのPodを入れてみましょう。

$ kubectl run nginx --image=nginx --generator=run-pod/v1 pod/nginx
pod/nginx created

デプロイできたかな？おもむろにpodの様子を見てみます。

$ kubectl get po
NAME    READY   STATUS    RESTARTS   AGE
nginx   0/1     Pending   0          7s

STATUSがPendingのままです。なぜでしょうか？
kubectl describe po nginx で理由を調べてみます。

$ kubectl describe po nginx
Name:         nginx
(略)
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  67s (x2 over 67s)  default-scheduler  0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

Podをデプロイできるノードがないようです。
一つのはmasterノードだから、他の三つは {node.kubernetes.io/not-ready: } のtaintがついているから...not ready?
ノードの様子を確認します。

$ kubectl get nodes -o wide -A
NAME                 STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                     KERNEL-VERSION                CONTAINER-RUNTIME
test-control-plane   NotReady   master   20m   v1.19.1   172.24.0.5    <none>        Ubuntu Groovy Gorilla (development branch)   4.19.104-microsoft-standard   containerd://1.4.0
test-worker          NotReady   <none>   19m   v1.19.1   172.24.0.2    <none>        Ubuntu Groovy Gorilla (development branch)   4.19.104-microsoft-standard   containerd://1.4.0
test-worker2         NotReady   <none>   19m   v1.19.1   172.24.0.3    <none>        Ubuntu Groovy Gorilla (development branch)   4.19.104-microsoft-standard   containerd://1.4.0
test-worker3         NotReady   <none>   19m   v1.19.1   172.24.0.4    <none>        Ubuntu Groovy Gorilla (development branch)   4.19.104-microsoft-standard   containerd://1.4.0

見事にノードが全滅しています。
全て、ノードのSTATUSがNot Readyであることがわかりました。

この惨事を前に無力感が浮かびます。

どうして、kind-test-clusterはNotReadyのままなのだろうか？
そしてNotReadyにもかかわらず、今こうして kubectl コマンドが使えるのはなぜだろうか？

なぜクラスタはNot Readyのままなのか？

幸いなことにkubectlコマンドは生きています。
そんなときはkubectl describeコマンドです。

試しにコントロールノードプレーンの様子を調べます。

$ kubectl describe nodes kind-control-plane
Name:               kind-control-plane
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kind-control-plane
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 23 Nov 2020 12:25:14 +0900
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  kind-control-plane
  AcquireTime:     <unset>
  RenewTime:       Mon, 23 Nov 2020 13:04:27 +0900
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------
    -------
  MemoryPressure   False   Mon, 23 Nov 2020 13:00:36 +0900   Mon, 23 Nov 2020 12:25:10 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 23 Nov 2020 13:00:36 +0900   Mon, 23 Nov 2020 12:25:10 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 23 Nov 2020 13:00:36 +0900   Mon, 23 Nov 2020 12:25:10 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 23 Nov 2020 13:00:36 +0900   Mon, 23 Nov 2020 12:25:10 +0900   KubeletNotReady
    runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:  172.24.0.5
  Hostname:    kind-control-plane
Capacity:
  cpu:                8
  ephemeral-storage:  263174212Ki
  hugepages-2Mi:      0
  memory:             13028732Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  263174212Ki
  hugepages-2Mi:      0
  memory:             13028732Ki
  pods:               110
System Info:
  Machine ID:                 2f7eb291060449f49af9640c2ba847e3
  System UUID:                2f7eb291060449f49af9640c2ba847e3
  Boot ID:                    6af2a095-3be7-4337-ae50-00f4736584eb
  Kernel Version:             4.19.104-microsoft-standard
  OS Image:                   Ubuntu Groovy Gorilla (development branch)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.0
  Kubelet Version:            v1.19.1
  Kube-Proxy Version:         v1.19.1
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
ProviderID:                   kind://docker/kind/kind-control-plane
Non-terminated Pods:          (5 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  kube-system                 etcd-kind-control-plane                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         39m
  kube-system                 kube-apiserver-kind-control-plane             250m (3%)     0 (0%)      0 (0%)           0 (0%)         39m
  kube-system                 kube-controller-manager-kind-control-plane    200m (2%)     0 (0%)      0 (0%)           0 (0%)         39m
  kube-system                 kube-proxy-lhp2l                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         39m
  kube-system                 kube-scheduler-kind-control-plane             100m (1%)     0 (0%)      0 (0%)           0 (0%)         39m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                550m (6%)  0 (0%)
  memory             0 (0%)     0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
Events:
  Type    Reason                   Age                From                            Message
  ----    ------                   ----               ----                            -------
  Normal  NodeHasSufficientMemory  39m (x6 over 39m)  kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    39m (x5 over 39m)  kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     39m (x5 over 39m)  kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  Starting                 39m                kubelet, kind-control-plane     Starting kubelet.
  Normal  NodeHasSufficientMemory  39m                kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    39m                kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     39m                kubelet, kind-control-plane     Node kind-control-plane status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  39m                kubelet, kind-control-plane     Updated Node Allocatable limit across pods
  Normal  Starting                 38m                kube-proxy, kind-control-plane  Starting kube-proxy.

非常に紛らわしいエラーログも混ざっていますが、注目すべきはこの警告文です

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

冒頭とやっと繋がりました。CNIプラグインが初期化されていない(=入ってないから)。ってそりゃそうだ。CNIを指定せずにKubernetesを構築したのだから。

この警告文はどこから出力されているのでしょうか。
Kubernetesのアーキテクチャ図をベースに考えてみましょう。

鍵となるのは kubelet です。kubeletはKubernetesの各ノード上で起動し、コンテナを動かすための環境であるコンテナランタイムと連携することで、コンテナの起動や停止などの管理を行います。

この警告文はkubeletから出力されています。しかし、kubeletはコンテナランタイムのプラグインから受け取ったメッセージをそのまま出しているだけです。なぜCNIプラグインが必要なのか、それはkindで使用されているコンテナランタイムを見なくてはいけません。

kindのコンテナランタイム

kindはDockerコンテナをノードとして使用して、Kubernetesを構築するツールであることを冒頭で説明しましたが、実際にkind上でPodが動くとき、そのPodのコンテナはどこで動いているのでしょうか。答えはkindのコンテナ用ベースイメージにあります。

kindはDocker上にUbuntuのコンテナを展開してその上でcontainerdというコンテナランタイムを動かしていることがわかります。
kindにデプロイされたPodはDockerの上のUbuntuコンテナの上のcontainerd上で動かされているのですね。

ちなみに、kindの開発当初はcontainerdではなく、Dockerが動いていました。
その後、 lighter, faster, easier to debug ということで途中でcontainerdに切り替えられた経緯があります。dindという響きは嫌いではないが、これも大きな時代の流れ。。

さて、話を戻すとCNIプラグインが初期化されていないというエラー文はこのコンテナランタイムからどうやら出ているようです。

警告文である "cni plugin not initialized" で検索すると、containerdの内部で使われているライブラリ go-cniにたどり着きます。(containerd/containerdのコミットも引っかかりますがリポジトリ分割前の古いソースコードです)

var (
	ErrCNINotInitialized = errors.New("cni plugin not initialized")abhi, 3 years ago: • Making errors great again!!
	ErrInvalidConfig     = errors.New("invalid cni config")
	ErrNotFound          = errors.New("not found")
	ErrRead              = errors.New("failed to read config file")
	ErrInvalidResult     = errors.New("invalid result")
	ErrLoad              = errors.New("failed to load cni config")
)

ErrCNINotInitializedはどのあたりで呼び出されているのでしょうか。
さらに調べるとcontainerd/cri というリポジトリにたどり着きます。(こちらは go-cni とは逆に2020/10に containerd/containerd リポジトリにマージされましたが、criで説明致します。)

	// Check the status of the cni initialization
	if err := c.netPlugin.Status(); err != nil {
		networkCondition.Status = false
		networkCondition.Reason = networkNotReadyReason
		networkCondition.Message = fmt.Sprintf("Network plugin returns error: %v", err)
	}

containerd/criはどういう目的のリポジトリでしょうか。

READMEにあるイメージです。containerdはコンテナランタイムとしてコンテナの作成、実行環境として動作しますが、その中で containerd/cri はkubeletとcontainerdを繋ぐCRIプラグインの役割をします。その中の Status() メソッドでエラーを返しています。

kubeletからcontainerdのステータスチェックを行った際に、containerdがエラーを返したのでKubeletNotReady となり、ノードは使えない状態になったのでした。

エラーログの正体がわかりました。ただし、まだまだ疑問が残ってます。

どうして、criはCNIプラグインの初期化を必要としているのでしょう。そしてCNIプラグインの初期化が失敗するとcontainerdのステータスもエラーになるのはどうしてでしょうか(ネットワークなんか無視して起動してもいいじゃん。直感でヤバそうですが...)

その答えはCNIのインターフェース仕様書にありました。

CNIのインターフェース仕様書

CNIは"Kubernetesの三大インターフェースのうちの一つ*。コンテナのネットワーク機能を抽象化し、IF仕様として切り出したもの"であることを冒頭で説明しましたが、その仕様書は下記の場所にあります。

この仕様書に従うことでKubernetesのネットワークプラグインはできているのですね。
仕様書に従ってれば動くので、例えばbashで構築することもできるようです。

@hichiharaさんによるわかりやすい解説記事=>CNCF CNI プラグイン

Kubernetes Networking: How to Write Your Own CNI Plug-in with Bash

この仕様書の冒頭、 General considerations(一般的な考慮事項)にコンテナランタイムの挙動に関する説明があります。

The container runtime must create a new network namespace for the container before invoking any plugins.
コンテナランタイムは、プラグインを起動する前に、コンテナ用の新しいnetwork namespaceを作成しなければいけません。

The runtime must then determine which networks this container should belong to, and for each network, which plugins must be executed.
次に、ランタイムは、このコンテナがどのネットワークに属するべきかを決定し、各ネットワークに対して、どのプラグインを実行しなければならないかを決定しなければいけません。

The container runtime must add the container to each network by executing the corresponding plugins for each network sequentially.
コンテナランタイムは、各ネットワークに対応するプラグインを順次実行することにより、コンテナを各ネットワークに追加しなければいけません。

containerdはこのCNIの考慮事項に従って設計されていたのですね。ネットワークなんか無視して起動してもいいじゃん、というのは大間違いで、その実はコンテナランタイムとコンテナネットワークプラグインが強固に結びついており、どちらが欠けてもKubernetesは動作しないのでした。

コンテナネットワークに対応するプラグインは複数設定できる

読んでて気になったのですが、コンテナネットワークに対応するプラグインを複数適用できるような書きっぷりですが、実際にそういう環境はありえます。

例えばOpenShiftの標準的な構成には最初からOpenShift-SDNとMultus CNIの二種類のCNIが入っています。後者のMultus CNIが肝要で、Docker由来の標準的なコンテナネットワーク構成ではできない複数のNICをPodに生やすことで多重ネットワーク環境をKubernetes上に構築することができます。

RedHatさんのテックブログであるMultusで遊ぶが概要を把握するうえで非常にわかりやすいです。

kubeadmの挙動

Kubernetesの構築をいい感じにサポートしてくれるツール kubeadm もデフォルトでCNIの設定を必須としています。

おまけ

CNIを抜いた状態でkindによりKubernetesを構築すると、ノードは全滅した状態ですがnamespaceは普通に生きていたりします。関係しないんですね。

$ kubectl get namespaces
NAME                 STATUS   AGE
default              Active   49m
kube-node-lease      Active   49m
kube-public          Active   49m
kube-system          Active   49m
local-path-storage   Active   49m

また、Podの状態を調べると coredns と local-path-provisioner がPendingの状態ですが、普通にActiveなPodもあります。ここまで読んだ奇特などなたか、理由を教えてください!!(多分Static Podだからだと思うのですが)

HOST:~/work/Certified-Kubernetes-Security-Specialist/hands-on/00-kind-cluster$ kubectl get pods -A | grep Pending
kube-system          coredns-f9fd979d6-4fjkn                      0/1     Pending   0          25m
kube-system          coredns-f9fd979d6-b46cm                      0/1     Pending   0          25m
local-path-storage   local-path-provisioner-78776bfc44-rtrxf      0/1     Pending   0          25m

PendingなPodについて、実は微妙に理由が異なります。

local-path-provisionerがPendingなのは、Readyなノードがなくて行き場をなくしているからです。上記でデプロイしたnginxと同じ理由です。

corednsがPendingなのは、Kubernetes内のDNS解決を行うというコンポーネント上、CNIのデプロイが必須だからだったりします。しかし、Readyなノードがないとデプロイもできないわけで、🐓と🥚的な話でしょうか。

coredns (or kube-dns) is stuck in the Pending state