iTranslated by AI
Trying Out Kubernetes Cluster Analysis with k8sgpt and LocalAI
Overview
k8sGPT is a project for using AI to identify issues and troubleshoot k8s clusters. Although it's a relatively young project with its first commit in March 2023, it was accepted as a CNCF sandbox project in December 2023 and currently has around 5k GitHub stars.
One of the features of k8sGPT is that it can use not only ChatGPT and AI services from major cloud providers but also locally deployable LLMs such as LocalAI and ollama. This is ideal for cases where using public AI services is difficult from a data confidentiality perspective, or when you want to perform analysis using models prepared locally. Refer to the documentation for supported AIs.
In this article, I will try using it in combination with LocalAI built locally.
Environment Setup
Building LocalAI
LocalAI can be built on Docker or Kubernetes, but here I will use Docker, focusing on simplicity.
Official Docker images are available, but the image tags are divided into several types depending on the use case. They are broadly categorized as follows (see https://localai.io/basics/container/ for details):
- Whether to use GPU or CPU only
- Whether to use pre-configured models
In this case, since I'll be using pre-configured models and a CPU-only setup, I'll use the localai/localai:latest-aio-cpu image tag.
I'll create a docker-compose.yml by referring to the examples in Usage and GitHub.
services:
api:
container_name: localai
image: localai/localai:latest-aio-cpu
ports:
- 8080:8080
environment:
LOCALAI_API_KEY: test
volumes:
- ./models:/build/models:cached
LocalAI has various configuration items, but here I'll basically use the default settings.
Also, to accept only authenticated requests, I'll set an API key in LOCALAI_API_KEY. As with normal API keys, a random string that is hard to guess is recommended, but for simplicity, I've specified test.
When you start the container with docker-compose up -d, each model described in https://localai.io/basics/container/#all-in-one-images will be downloaded. You can check the progress from the container logs using docker logs localai.
Once the download is complete and LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080 appears in the logs, it will start accepting requests. Verify the operation by calling the API as described in Try it out.
Since an API key is set this time, requests that do not include it in the header will be rejected.
$ curl http://192.168.3.204:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?"}] }'
{"message":"Authorization header missing"}%
The request will pass if you specify "Authorization: Bearer [api_key]".
$ curl http://192.168.3.204:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test" \
-d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?"}] }'
{
"created": 1723306961,
"object": "chat.completion",
"id": "fc893abe-6ae6-4ff8-856b-dfda6ed41138",
"model": "gpt-4",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Hello! I'm doing well, thank you for asking. How about you? How can I assist you today?"
}
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 24,
"total_tokens": 38
}
}
You can check the access logs in the LocalAI container logs. The communication from ip=192.168.3.30 corresponds to the request above.
10:08AM INF Success ip=127.0.0.1 latency="71.495µs" method=GET status=200 url=/readyz
10:08AM WRN Client error ip=192.168.3.30 latency="77.404µs" method=POST status=401 url=/v1/chat/completions
10:08AM INF Trying to load the model 'b5869d55688a529c3738cb044e92c331' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback rwkv piper stablediffusion whisper huggingface bert-embeddings /build/backend/python/openvoice/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/vllm/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/petals/run.sh /build/backend/python/bark/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/coqui/run.sh /build/backend/python/transformers/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/exllama/run.sh /build/backend/python/mamba/run.sh /build/backend/python/diffusers/run.sh]'
10:08AM INF [llama-cpp] Attempting to load
10:08AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp
WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory
WARNING: error parsing the pci address "virtio0"
10:08AM INF [llama-cpp] attempting to load with AVX2 variant
10:08AM INF [llama-cpp] Loads OK
10:09AM INF Success ip=192.168.3.30 latency=28.564574887s method=POST status=200 url=/v1/chat/completions
Building k8sGPT
k8sGPT can analyze a cluster by either running commands via CLI or by deploying an operator to the cluster. Here, we will use the Operator.
The Operator can be installed with Helm.
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt k8sgpt/k8sgpt-operator -n k8sgpt-operator-system --create-namespace
Next, create a kind: K8sGPT custom resource. This resource serves as the "backend provider" that communicates with the backend (LocalAI in this case) to perform the in-cluster analysis.
Create a secret for the LocalAI API key created earlier.
kubectl create secret generic k8sgpt-localai-secret --from-literal=api-key=test -n k8sgpt-operator-system
To specify LocalAI as the backend, prepare a manifest following the GitHub example. The parts that need modification are as follows:
| Property | Description | Value |
|---|---|---|
| backend | Specify the name of the backend provider to communicate with. | localai |
| model | Specify the name of the model to use for analysis in the destination backend provider. In LocalAI, the text generation model name is gpt-4, so specify that. |
gpt-4 |
| baseUrl | Specify the endpoint of the destination backend provider. | http://192.168.3.204:8080/v1 |
| version | Specify the version of k8sGPT. As of now, v0.3.40 is the latest version. | v0.3.40 |
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-localai
namespace: k8sgpt-operator-system
spec:
ai:
enabled: true
model: gpt-4
secret:
name: k8sgpt-localai-secret
key: api-key
backend: localai
baseUrl: http://192.168.3.204:8080/v1
noCache: false
version: v0.3.40
Once deployed, the operator detects the creation and the k8sGPT pod starts up.
$ k get pod
NAME READY STATUS RESTARTS AGE
k8s-operator-k8sgpt-operator-controller-manager-69896dc68dmjg2j 2/2 Running 0 69s
k8sgpt-localai-67b4bd6497-jj8wl 1/1 Running 0 49s
Analysis
Now that everything is ready, let's deploy the following manifest from the documentation to trigger an error.
apiVersion: v1
kind: Pod
metadata:
name: broken-pod
namespace: default
spec:
containers:
- name: broken-pod
image: nginx:1.a.b.c # The pod won't start because the image tag is invalid
livenessProbe:
httpGet:
path: /
port: 81
initialDelaySeconds: 3
periodSeconds: 3
After waiting a short while after deployment, the analysis will complete, and the results will be created as results.core.k8sgpt.ai resources.
$ k get results.core.k8sgpt.ai
NAME KIND BACKEND
argocdargocdapplicationcontroller StatefulSet localai
backstagebackstagefront Service localai
backstagesamplenginxsvc Service localai
defaultbrokenpod Pod localai
Details of the analysis can be checked in the spec of the YAML. The results indicate that the image pull failed and provide proposed solutions.
$ k get results.core.k8sgpt.ai defaultbrokenpod -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2024-08-11T10:25:45Z"
generation: 1
labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-localai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: defaultbrokenpod
namespace: k8sgpt-operator-system
resourceVersion: "8013182"
uid: ae75161d-8d96-42e3-a80e-c931ccc96352
spec:
backend: localai
details: |-
Error: The Kubernetes container is unable to pull the specified image "nginx:1.a.b.c" due to a network issue or image availability problem.
Solution: 1) Check your internet connection. 2) Verify the image is available and tagged correctly. 3) Retry the image pull operation. If the problem persists, consider using a different image version.
error:
- text: Back-off pulling image "nginx:1.a.b.c
kind: Pod
name: default/broken-pod
parentObject: ""
status:
lifecycle: historical
Since the k8sgpt pod is running in server mode, analysis results can also be retrieved from inside or outside the cluster using gRPC.
Check the CLUSTER-IP of the service created along with the k8sgpt pod.
$ k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
k8sgpt-localai ClusterIP 10.109.241.92 <none> 8080/TCP 10m
Running grpcurl against the above IP from within the cluster retrieves the results in JSON format. The content corresponds to spec.error[].text from the result resource.
$ grpcurl -plaintext -d '{"namespace": "default"}' 10.107.21.110:8080 schema.v1.ServerService/Analyze
{
"status": "ProblemDetected",
"problems": 1,
"results": [
{
"kind": "Pod",
"name": "default/broken-pod",
"error": [
{
"text": "Back-off pulling image \"nginx:1.a.b.c\""
}
]
}
]
}
According to the documentation, specifying explain: true should allow retrieving details as well, but for some reason, doing so causes the pod to error out and stop starting with a CrashLoopBackOff.
$ grpcurl -plaintext -d '{"explain": true, "namespace": "default"}' 10.107.21.110:8080 schema.v1.ServerService/Analyze
ERROR:
Code: Unavailable
Message: error reading from server: EOF
$ k logs k8sgpt-localai-67b4bd6497-qcmfq k8sgpt
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x261b43b]
goroutine 1 [running]:
github.com/k8sgpt-ai/k8sgpt/cmd/serve.init.func1(0xc0001db000?, {0x2f83ad6?, 0x4?, 0x2f83ada?})
/workspace/cmd/serve/serve.go:152 +0x5db
github.com/spf13/cobra.(*Command).execute(0x5361b40, {0x53f9000, 0x0, 0x0})
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0x535d920)
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
github.com/k8sgpt-ai/k8sgpt/cmd.Execute({0x37e78ec?, 0x0?}, {0x37e78ed?, 0x51693c0?}, {0x37e78ee?, 0xc0000061c0?})
/workspace/cmd/root.go:59 +0x91
main.main()
/workspace/main.go:25 +0x3d
Searching for the error on GitHub reveals several issues and PRs, so this might be fixed soon.
This cluster is a reuse of the one used when I wrote the article about Backstage, so several resources related to ArgoCD and Backstage are already deployed. Issues have been detected for those as well, so let's take a look.
Regarding the issue with ArgoCD, it points out that a StatefulSet named argocd-application-controller is using a non-existent service argocd-application-controller.
- apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2024-08-11T10:25:45Z"
generation: 1
labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-localai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: argocdargocdapplicationcontroller
namespace: k8sgpt-operator-system
resourceVersion: "8013179"
uid: b5411baf-cb92-4f37-82a1-fefbe2a0a942
spec:
backend: localai
details: |-
Error: StatefulSet uses a non-existent service argocd/argocd-application-controller.
Solution: 1. Check if the service name is correct. 2. Verify if the service is deployed. 3. Ensure the service is accessible within the cluster. 4. If the issue persists, recreate the StatefulSet with correct service details.
error:
- sensitive:
- masked: fCJqMndQ
unmasked: argocd
- masked: MVQ5Yl5vLTg7d0d5LDJmUHRjQjhmMXxeOVpOM0k=
unmasked: argocd-application-controller
text: StatefulSet uses the service argocd/argocd-application-controller which
does not exist.
kind: StatefulSet
name: argocd/argocd-application-controller
parentObject: ""
status:
lifecycle: historical
Looking at the actual resources, the StatefulSet exists, but there is indeed no service named argocd-application-controller.
Instead, the service argocd-applicationset-controller is associated with the deployment argocd-applicationset-controller.
$ k get statefulsets.apps
NAME READY AGE
argocd-application-controller 1/1 14d
$ k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-applicationset-controller ClusterIP 10.99.253.179 <none> 7000/TCP,8080/TCP 14d
argocd-dex-server ClusterIP 10.111.75.121 <none> 5556/TCP,5557/TCP,5558/TCP 14d
argocd-metrics ClusterIP 10.97.97.124 <none> 8082/TCP 14d
argocd-notifications-controller-metrics ClusterIP 10.111.78.10 <none> 9001/TCP 14d
argocd-redis ClusterIP 10.111.104.228 <none> 6379/TCP 14d
argocd-repo-server ClusterIP 10.96.70.22 <none> 8081/TCP,8084/TCP 14d
argocd-server ClusterIP 10.110.184.220 <none> 80/TCP,443/TCP 14d
argocd-server-metrics ClusterIP 10.102.135.119 <none> 8083/TCP 14d
$ k describe svc argocd-applicationset-controller
Name: argocd-applicationset-controller
Namespace: argocd
Labels: app.kubernetes.io/component=applicationset-controller
app.kubernetes.io/name=argocd-applicationset-controller
app.kubernetes.io/part-of=argocd
Annotations: <none>
Selector: app.kubernetes.io/name=argocd-applicationset-controller
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.99.253.179
IPs: 10.99.253.179
Port: webhook 7000/TCP
TargetPort: webhook/TCP
Endpoints: 10.244.1.20:7000
Port: metrics 8080/TCP
TargetPort: metrics/TCP
Endpoints: 10.244.1.20:8080
Session Affinity: None
Events: <none>
$ k get pod -o wide argocd-application-controller-0 argocd-applicationset-controller-8485455fd5-vhrj7
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
argocd-application-controller-0 1/1 Running 1 (4d1h ago) 13d 10.244.1.24 k8s-w1 <none> <none>
argocd-applicationset-controller-8485455fd5-vhrj7 1/1 Running 1 (4d1h ago) 14d 10.244.1.20 k8s-w1 <none> <none>
However, the above resources were deployed following the standard ArgoCD installation procedure. Since it's impossible to check associated services from the pod side anyway, it's a mystery how it detected that the StatefulSet was using argocd-application-controller.
As mistakes are common in generative AI services, this point might not be very valid.
Regarding the backend-related issue, it states that no target pods are set for the service backstage-front.
- apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
creationTimestamp: "2024-08-11T10:25:45Z"
generation: 1
labels:
k8sgpts.k8sgpt.ai/backend: localai
k8sgpts.k8sgpt.ai/name: k8sgpt-localai
k8sgpts.k8sgpt.ai/namespace: k8sgpt-operator-system
name: backstagebackstagefront
namespace: k8sgpt-operator-system
resourceVersion: "8013181"
uid: b1d551d3-f078-4570-b82c-bc4ca92a57bb
spec:
backend: localai
details: |-
Error: The service with label app.kubernetes.io/name=backstage has no available endpoints.
Solution: Check the service and endpoint configuration, ensure the deployment is running, and verify the service is correctly associated with the desired endpoints. If needed, update or recreate the service and endpoints accordingly.
error:
- sensitive:
- masked: eVF1L3wsISNKUTBPVmJUWnkyLHN3fA==
unmasked: app.kubernetes.io/name
- masked: NmxoLDJ7UnRW
unmasked: backstage
text: Service has no endpoints, expected label app.kubernetes.io/name=backstage
kind: Service
name: backstage/backstage-front
parentObject: ""
status:
lifecycle: historical
Looking at the resources, the Endpoints are indeed not set, so this is confirmed to be a valid finding.
$ k describe svc backstage-front
Name: backstage-front
Namespace: backstage
Labels: <none>
Annotations: <none>
Selector: app.kubernetes.io/name=backstage
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.110.88.138
IPs: 10.110.88.138
Port: <unset> 80/TCP
TargetPort: 7007/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
Since the actual analysis is performed by the backend provider's AI service (LocalAI in this case), the results are heavily influenced by the model. However, it is capable of detecting not only resources in error states but also resources with configuration flaws like those mentioned above.
Resources supported for analysis in k8sGPT are called "Analyzers," and a list of supported Analyzers is available on GitHub.
Operator Architecture
While it is a very concise diagram, the operator's architecture is as follows.

Operator architecture diagram. Quoted from the documentation.
The K8sGPT deployment in the diagram corresponds to the K8sGPT pod, which communicates with the external LLM specified as the backend provider to analyze resources.
Then, the k8sGPT operator communicates with the K8sGPT pod and publishes the results as result custom resources based on the analysis findings.
Checking the logs of the operator pod, there is a line Creating new client for 10.108.64.236:8080, which corresponds to the IP address of the k8sgpt container within the target k8sGPT pod.
We can see that it creates identified issues such as argocdargocdapplicationcontroller as results based on the analysis results from the k8sGPT pod.
Also, even after a result is created, it performs Reconciling to periodically check if the identified issues have been resolved.
2024-08-11T10:19:02Z INFO Starting workers {"controller": "k8sgpt", "controllerGroup": "core.k8sgpt.ai", "controllerKind": "K8sGPT", "worker count": 1}
Finished Reconciling k8sGPT
Finished Reconciling k8sGPT
Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Created result argocdargocdapplicationcontroller
Created result backstagesamplenginxsvc
Created result backstagebackstagefront
Created result defaultbrokenpod
Finished Reconciling k8sGPT
Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Checking if argocdargocdapplicationcontroller is still relevant
Checking if backstagesamplenginxsvc is still relevant
Checking if backstagebackstagefront is still relevant
Checking if defaultbrokenpod is still relevant
Finished Reconciling k8sGPT
When an identified issue is fixed, no specific log stating it's resolved is output, but you can tell because Checking if [resource] is still relevant no longer appears.
# When defaultbrokenpod is fixed
Creating new client for 10.108.64.236:8080
Connection established between 10.108.64.236:8080 and localhost with time out of 1 seconds.
Remote Address : 10.108.64.236:8080
K8sGPT address: 10.108.64.236:8080
Checking if backstagebackstagefront is still relevant
Checking if argocdargocdapplicationcontroller is still relevant
Checking if backstagesamplenginxsvc is still relevant
Finished Reconciling k8sGPT
Conclusion
k8sGPT is a relatively young project, so the documentation is quite concise and there are still few configuration options, but given the recent LLM boom, we can expect future functional enhancements.
As one of the concepts states, "Codified SRE knowledge knows what to search for," the big advantage is being able to use AI to perform SRE-related tasks that require knowledge and experience, such as Kubernetes troubleshooting, triage, and vulnerability assessment.
By the way, there is also a blog post on CNCF.
Discussion