iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
⚙️

Trying Out Kubernetes Cluster Analysis with k8sgpt and LocalAI

に公開

Overview

k8sGPT is a project for using AI to identify issues and troubleshoot k8s clusters. Although it's a relatively young project with its first commit in March 2023, it was accepted as a CNCF sandbox project in December 2023 and currently has around 5k GitHub stars.

https://k8sgpt.ai/

One of the features of k8sGPT is that it can use not only ChatGPT and AI services from major cloud providers but also locally deployable LLMs such as LocalAI and ollama. This is ideal for cases where using public AI services is difficult from a data confidentiality perspective, or when you want to perform analysis using models prepared locally. Refer to the documentation for supported AIs.
In this article, I will try using it in combination with LocalAI built locally.

Environment Setup

Building LocalAI

LocalAI can be built on Docker or Kubernetes, but here I will use Docker, focusing on simplicity.

https://localai.io/docs/getting-started/models/

Official Docker images are available, but the image tags are divided into several types depending on the use case. They are broadly categorized as follows (see https://localai.io/basics/container/ for details):

  • Whether to use GPU or CPU only
  • Whether to use pre-configured models

In this case, since I'll be using pre-configured models and a CPU-only setup, I'll use the localai/localai:latest-aio-cpu image tag.
I'll create a docker-compose.yml by referring to the examples in Usage and GitHub.

docker-compose.yml
services:
  api:
    container_name: localai
    image: localai/localai:latest-aio-cpu
    ports:
      - 8080:8080
    environment:
      LOCALAI_API_KEY: test
    volumes:
      - ./models:/build/models:cached

LocalAI has various configuration items, but here I'll basically use the default settings.
Also, to accept only authenticated requests, I'll set an API key in LOCALAI_API_KEY. As with normal API keys, a random string that is hard to guess is recommended, but for simplicity, I've specified test.
When you start the container with docker-compose up -d, each model described in https://localai.io/basics/container/#all-in-one-images will be downloaded. You can check the progress from the container logs using docker logs localai.

Once the download is complete and LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080 appears in the logs, it will start accepting requests. Verify the operation by calling the API as described in Try it out.
Since an API key is set this time, requests that do not include it in the header will be rejected.

$ curl http://192.168.3.204:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?"}] }'

{"message":"Authorization header missing"}%

The request will pass if you specify "Authorization: Bearer [api_key]".

$ curl http://192.168.3.204:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer test" \
    -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?"}] }'

{
  "created": 1723306961,
  "object": "chat.completion",
  "id": "fc893abe-6ae6-4ff8-856b-dfda6ed41138",
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How about you? How can I assist you today?"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 24,
    "total_tokens": 38
  }
}

You can check the access logs in the LocalAI container logs. The communication from ip=192.168.3.30 corresponds to the request above.

10:08AM INF Success ip=127.0.0.1 latency="71.495µs" method=GET status=200 url=/readyz
10:08AM WRN Client error ip=192.168.3.30 latency="77.404µs" method=POST status=401 url=/v1/chat/completions
10:08AM INF Trying to load the model 'b5869d55688a529c3738cb044e92c331' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback rwkv piper stablediffusion whisper huggingface bert-embeddings /build/backend/python/openvoice/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/vllm/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/petals/run.sh /build/backend/python/bark/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/coqui/run.sh /build/backend/python/transformers/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/exllama/run.sh /build/backend/python/mamba/run.sh /build/backend/python/diffusers/run.sh]'
10:08AM INF [llama-cpp] Attempting to load
10:08AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp
WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory
WARNING: error parsing the pci address "virtio0"
10:08AM INF [llama-cpp] attempting to load with AVX2 variant
10:08AM INF [llama-cpp] Loads OK
10:09AM INF Success ip=192.168.3.30 latency=28.564574887s method=POST status=200 url=/v1/chat/completions

Building k8sGPT

k8sGPT can analyze a cluster by either running commands via CLI or by deploying an operator to the cluster. Here, we will use the Operator.
The Operator can be installed with Helm.

https://docs.k8sgpt.ai/getting-started/in-cluster-operator/

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt k8sgpt/k8sgpt-operator -n k8sgpt-operator-system --create-namespace

Next, create a kind: K8sGPT custom resource. This resource serves as the "backend provider" that communicates with the backend (LocalAI in this case) to perform the in-cluster analysis.
Create a secret for the LocalAI API key created earlier.

kubectl create secret generic k8sgpt-localai-secret --from-literal=api-key=test -n k8sgpt-operator-system

To specify LocalAI as the backend, prepare a manifest following the GitHub example. The parts that need modification are as follows:

Property Description Value
backend Specify the name of the backend provider to communicate with. localai
model Specify the name of the model to use for analysis in the destination backend provider. In LocalAI, the text generation model name is gpt-4, so specify that. gpt-4
baseUrl Specify the endpoint of the destination backend provider. http://192.168.3.204:8080/v1
version Specify the version of k8sGPT. As of now, v0.3.40 is the latest version. v0.3.40
k8sgpt-localai.yml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-localai
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: gpt-4
    secret:
      name: k8sgpt-localai-secret
      key: api-key
    backend: localai
    baseUrl: http://192.168.3.204:8080/v1
  noCache: false
  version: v0.3.40

Once deployed, the operator detects the creation and the k8sGPT pod starts up.

$ k get pod
NAME                                                              READY   STATUS    RESTARTS   AGE
k8s-operator-k8sgpt-operator-controller-manager-69896dc68dmjg2j   2/2     Running   0          69s
k8sgpt-localai-67b4bd6497-jj8wl                                   1/1     Running   0          49s

Analysis

Now that everything is ready, let's deploy the following manifest from the documentation to trigger an error.

apiVersion: v1
kind: Pod
metadata:
  name: broken-pod
  namespace: default
spec:
  containers:
    - name: broken-pod
      image: nginx:1.a.b.c  # The pod won't start because the image tag is invalid
      livenessProbe:
        httpGet:
          path: /
          port: 81
        initialDelaySeconds: 3
        periodSeconds: 3

After waiting a short while after deployment, the analysis will complete, and the results will be created as results.core.k8sgpt.ai resources.

$ k get results.core.k8sgpt.ai
NAME                                KIND          BACKEND
argocdargocdapplicationcontroller   StatefulSet   localai
backstagebackstagefront             Service       localai
backstagesamplenginxsvc             Service       localai
defaultbrokenpod                    Pod           localai

Details of the analysis can be checked in the spec of the YAML. The results indicate that the image pull failed and provide proposed solutions.

$ k get results.core.k8sgpt.ai  defaultbrokenpod -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2024-08-11T10:25:45Z"
  generation: 1
  labels:
    k8sgpts.k8sgpt.ai/backend: localai
    k8sgpts.k8sgpt.ai/name: k8sgpt-localai
    k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
  name: defaultbrokenpod
  namespace: k8sgpt-operator-system
  resourceVersion: "8013182"
  uid: ae75161d-8d96-42e3-a80e-c931ccc96352
spec:
  backend: localai
  details: |-
    Error: The Kubernetes container is unable to pull the specified image "nginx:1.a.b.c" due to a network issue or image availability problem.
    Solution: 1) Check your internet connection. 2) Verify the image is available and tagged correctly. 3) Retry the image pull operation. If the problem persists, consider using a different image version.
  error:
  - text: Back-off pulling image "nginx:1.a.b.c
  kind: Pod
  name: default/broken-pod
  parentObject: ""
status:
  lifecycle: historical

Since the k8sgpt pod is running in server mode, analysis results can also be retrieved from inside or outside the cluster using gRPC.
Check the CLUSTER-IP of the service created along with the k8sgpt pod.

$ k get svc
NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
k8sgpt-localai                                            ClusterIP   10.109.241.92   <none>        8080/TCP   10m

Running grpcurl against the above IP from within the cluster retrieves the results in JSON format. The content corresponds to spec.error[].text from the result resource.

$ grpcurl -plaintext -d '{"namespace": "default"}' 10.107.21.110:8080 schema.v1.ServerService/Analyze
{
  "status": "ProblemDetected",
  "problems": 1,
  "results": [
    {
      "kind": "Pod",
      "name": "default/broken-pod",
      "error": [
        {
          "text": "Back-off pulling image \"nginx:1.a.b.c\""
        }
      ]
    }
  ]
}

According to the documentation, specifying explain: true should allow retrieving details as well, but for some reason, doing so causes the pod to error out and stop starting with a CrashLoopBackOff.

$ grpcurl -plaintext -d '{"explain": true, "namespace": "default"}' 10.107.21.110:8080 schema.v1.ServerService/Analyze
ERROR:
  Code: Unavailable
  Message: error reading from server: EOF
$ k logs k8sgpt-localai-67b4bd6497-qcmfq k8sgpt
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x261b43b]

goroutine 1 [running]:
github.com/k8sgpt-ai/k8sgpt/cmd/serve.init.func1(0xc0001db000?, {0x2f83ad6?, 0x4?, 0x2f83ada?})
        /workspace/cmd/serve/serve.go:152 +0x5db
github.com/spf13/cobra.(*Command).execute(0x5361b40, {0x53f9000, 0x0, 0x0})
        /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0x535d920)
        /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/k8sgpt-ai/k8sgpt/cmd.Execute({0x37e78ec?, 0x0?}, {0x37e78ed?, 0x51693c0?}, {0x37e78ee?, 0xc0000061c0?})
        /workspace/cmd/root.go:59 +0x91
main.main()
        /workspace/main.go:25 +0x3d

Searching for the error on GitHub reveals several issues and PRs, so this might be fixed soon.


This cluster is a reuse of the one used when I wrote the article about Backstage, so several resources related to ArgoCD and Backstage are already deployed. Issues have been detected for those as well, so let's take a look.
Regarding the issue with ArgoCD, it points out that a StatefulSet named argocd-application-controller is using a non-existent service argocd-application-controller.

argocd
- apiVersion: core.k8sgpt.ai/v1alpha1
  kind: Result
  metadata:
    creationTimestamp: "2024-08-11T10:25:45Z"
    generation: 1
    labels:
      k8sgpts.k8sgpt.ai/backend: localai
      k8sgpts.k8sgpt.ai/name: k8sgpt-localai
      k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
    name: argocdargocdapplicationcontroller
    namespace: k8sgpt-operator-system
    resourceVersion: "8013179"
    uid: b5411baf-cb92-4f37-82a1-fefbe2a0a942
  spec:
    backend: localai
    details: |-
      Error: StatefulSet uses a non-existent service argocd/argocd-application-controller.
      Solution: 1. Check if the service name is correct. 2. Verify if the service is deployed. 3. Ensure the service is accessible within the cluster. 4. If the issue persists, recreate the StatefulSet with correct service details.
    error:
    - sensitive:
      - masked: fCJqMndQ
        unmasked: argocd
      - masked: MVQ5Yl5vLTg7d0d5LDJmUHRjQjhmMXxeOVpOM0k=
        unmasked: argocd-application-controller
      text: StatefulSet uses the service argocd/argocd-application-controller which
        does not exist.
    kind: StatefulSet
    name: argocd/argocd-application-controller
    parentObject: ""
  status:
    lifecycle: historical

Looking at the actual resources, the StatefulSet exists, but there is indeed no service named argocd-application-controller.
Instead, the service argocd-applicationset-controller is associated with the deployment argocd-applicationset-controller.

$ k get statefulsets.apps
NAME                            READY   AGE
argocd-application-controller   1/1     14d

$ k get svc
NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
argocd-applicationset-controller          ClusterIP   10.99.253.179    <none>        7000/TCP,8080/TCP            14d
argocd-dex-server                         ClusterIP   10.111.75.121    <none>        5556/TCP,5557/TCP,5558/TCP   14d
argocd-metrics                            ClusterIP   10.97.97.124     <none>        8082/TCP                     14d
argocd-notifications-controller-metrics   ClusterIP   10.111.78.10     <none>        9001/TCP                     14d
argocd-redis                              ClusterIP   10.111.104.228   <none>        6379/TCP                     14d
argocd-repo-server                        ClusterIP   10.96.70.22      <none>        8081/TCP,8084/TCP            14d
argocd-server                             ClusterIP   10.110.184.220   <none>        80/TCP,443/TCP               14d
argocd-server-metrics                     ClusterIP   10.102.135.119   <none>        8083/TCP                     14d

$ k describe svc argocd-applicationset-controller
Name:              argocd-applicationset-controller
Namespace:         argocd
Labels:            app.kubernetes.io/component=applicationset-controller
                   app.kubernetes.io/name=argocd-applicationset-controller
                   app.kubernetes.io/part-of=argocd
Annotations:       <none>
Selector:          app.kubernetes.io/name=argocd-applicationset-controller
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.99.253.179
IPs:               10.99.253.179
Port:              webhook  7000/TCP
TargetPort:        webhook/TCP
Endpoints:         10.244.1.20:7000
Port:              metrics  8080/TCP
TargetPort:        metrics/TCP
Endpoints:         10.244.1.20:8080
Session Affinity:  None
Events:            <none>

$ k get pod -o wide argocd-application-controller-0 argocd-applicationset-controller-8485455fd5-vhrj7
NAME                                                READY   STATUS    RESTARTS       AGE   IP            NODE     NOMINATED NODE   READINESS GATES
argocd-application-controller-0                     1/1     Running   1 (4d1h ago)   13d   10.244.1.24   k8s-w1   <none>           <none>
argocd-applicationset-controller-8485455fd5-vhrj7   1/1     Running   1 (4d1h ago)   14d   10.244.1.20   k8s-w1   <none>           <none>

However, the above resources were deployed following the standard ArgoCD installation procedure. Since it's impossible to check associated services from the pod side anyway, it's a mystery how it detected that the StatefulSet was using argocd-application-controller.
As mistakes are common in generative AI services, this point might not be very valid.

Regarding the backend-related issue, it states that no target pods are set for the service backstage-front.

backend
- apiVersion: core.k8sgpt.ai/v1alpha1
  kind: Result
  metadata:
    creationTimestamp: "2024-08-11T10:25:45Z"
    generation: 1
    labels:
      k8sgpts.k8sgpt.ai/backend: localai
      k8sgpts.k8sgpt.ai/name: k8sgpt-localai
      k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
    name: backstagebackstagefront
    namespace: k8sgpt-operator-system
    resourceVersion: "8013181"
    uid: b1d551d3-f078-4570-b82c-bc4ca92a57bb
  spec:
    backend: localai
    details: |-
      Error: The service with label app.kubernetes.io/name=backstage has no available endpoints.
      Solution: Check the service and endpoint configuration, ensure the deployment is running, and verify the service is correctly associated with the desired endpoints. If needed, update or recreate the service and endpoints accordingly.
    error:
    - sensitive:
      - masked: eVF1L3wsISNKUTBPVmJUWnkyLHN3fA==
        unmasked: app.kubernetes.io/name
      - masked: NmxoLDJ7UnRW
        unmasked: backstage
      text: Service has no endpoints, expected label app.kubernetes.io/name=backstage
    kind: Service
    name: backstage/backstage-front
    parentObject: ""
  status:
    lifecycle: historical

Looking at the resources, the Endpoints are indeed not set, so this is confirmed to be a valid finding.

$ k describe svc backstage-front
Name:              backstage-front
Namespace:         backstage
Labels:            <none>
Annotations:       <none>
Selector:          app.kubernetes.io/name=backstage
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.110.88.138
IPs:               10.110.88.138
Port:              <unset>  80/TCP
TargetPort:        7007/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

Since the actual analysis is performed by the backend provider's AI service (LocalAI in this case), the results are heavily influenced by the model. However, it is capable of detecting not only resources in error states but also resources with configuration flaws like those mentioned above.
Resources supported for analysis in k8sGPT are called "Analyzers," and a list of supported Analyzers is available on GitHub.

Operator Architecture

While it is a very concise diagram, the operator's architecture is as follows.


Operator architecture diagram. Quoted from the documentation.

The K8sGPT deployment in the diagram corresponds to the K8sGPT pod, which communicates with the external LLM specified as the backend provider to analyze resources.
Then, the k8sGPT operator communicates with the K8sGPT pod and publishes the results as result custom resources based on the analysis findings.

Checking the logs of the operator pod, there is a line Creating new client for 10.108.64.236:8080, which corresponds to the IP address of the k8sgpt container within the target k8sGPT pod.
We can see that it creates identified issues such as argocdargocdapplicationcontroller as results based on the analysis results from the k8sGPT pod.
Also, even after a result is created, it performs Reconciling to periodically check if the identified issues have been resolved.

Operator pod logs
2024-08-11T10:19:02Z    INFO    Starting workers        {"controller": "k8sgpt", "controllerGroup": "core.k8sgpt.ai", "controllerKind": "K8sGPT", "worker count": 1}
Finished Reconciling k8sGPT
Finished Reconciling k8sGPT
Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Created result argocdargocdapplicationcontroller
Created result backstagesamplenginxsvc
Created result backstagebackstagefront
Created result defaultbrokenpod
Finished Reconciling k8sGPT

Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Checking if argocdargocdapplicationcontroller is still relevant
Checking if backstagesamplenginxsvc is still relevant
Checking if backstagebackstagefront is still relevant
Checking if defaultbrokenpod is still relevant
Finished Reconciling k8sGPT

When an identified issue is fixed, no specific log stating it's resolved is output, but you can tell because Checking if [resource] is still relevant no longer appears.

Operator pod logs
# When defaultbrokenpod is fixed

Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Checking if backstagebackstagefront is still relevant
Checking if argocdargocdapplicationcontroller is still relevant
Checking if backstagesamplenginxsvc is still relevant
Finished Reconciling k8sGPT

Conclusion

k8sGPT is a relatively young project, so the documentation is quite concise and there are still few configuration options, but given the recent LLM boom, we can expect future functional enhancements.
As one of the concepts states, "Codified SRE knowledge knows what to search for," the big advantage is being able to use AI to perform SRE-related tasks that require knowledge and experience, such as Kubernetes troubleshooting, triage, and vulnerability assessment.
By the way, there is also a blog post on CNCF.

https://www.cncf.io/blog/2024/07/11/now-what-kubernetes-troubleshooting-with-ai/

Discussion