iTranslated by AI
K8s Cluster Tracing with OpenTelemetry and Jaeger
Overview
The three pillars of Observability consist of the following elements:
- Logging
- Metrics
- Tracing
In previous articles, I covered Logging and Metrics.
This time, I will focus on the third one, Tracing, and implement a method to collect tracing data from components running on a k8s cluster.
Tracing often gives the impression that it hasn't permeated as much as Logging or Metrics (this is purely subjective), but it has become important in the operation of microservice architectures, which are mainstream today.
As described in the Elasticsearch documentation, introducing Tracing into an existing system offers the following benefits:
Latency tracking
A single user request or transaction travels through various services in different runtime environments. Understanding the latency of each service for a specific request is essential for understanding the overall performance characteristics of the system as a whole and gaining valuable insights into potential improvements.
Root cause analysis
Root cause analysis is an even greater challenge for applications built on large ecosystems of microservices. You never know what problem will occur, in which service, or at what timing. Distributed tracing is crucial when debugging problems in such systems.
When building a Tracing collection infrastructure with OSS products, a configuration using OpenTelemetry collector (hereafter referred to as otel collector) and Jaeger for collection and visualization is mainstream. Therefore, we will install these components into a k8s cluster this time.
Otel collector
To collect and process tracing data from applications on the cluster, we use the otel collector.
Among the several methods described in the documentation, we will install the collector using the method that uses the Operator.
First, we will create a tracing namespace to isolate the series of components installed in this task.
kubectl create namespace tracing
The documentation describes applying the otel operator manifest published in the GitHub releases to the cluster. However, by default, it is configured to create a namespace called opentelemetry-operator-system and define CRDs etc. there. To change the installation destination to the tracing namespace, download the manifest locally first, replace occurrences of namespace: opentelemetry-operator-system with tracing using an editor, and then apply it.
# Download locally
wget https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml -O opentelemetry-operator.yaml
# Edit with an editor etc.
# Apply
kubectl apply -f opentelemetry-operator.yaml
You can deploy the otel collector pod to the cluster by creating a CR called an otel collector instance specified by kind: OpenTelemetryCollector.
An example manifest is provided in the documentation mentioned above, but basically, you describe the settings for the collector's receivers, processors, and exporters under spec.config. This becomes the config file used inside the otel collector, and it is created as a configmap and mounted to the pod during deployment.
Also, Jaeger, which is specified as the tracing destination, will be created in the next step.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: tracing
spec:
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 10s
exporters:
# NOTE: Prior to v0.86.0 use `logging` instead of `debug`.
logging:
verbosity: detailed
otlp:
endpoint: jaeger-collector:4317
tls:
insecure: true
otlphttp:
endpoint: http://jaeger-collector:4318
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging,otlp,otlphttp]
When deployed, the pods and services corresponding to the collector are created.
NAME READY STATUS RESTARTS AGE
pod/otel-collector-collector-768f67fcfc-f2whp 1/1 Running 0 3d7h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/otel-collector-collector ClusterIP 10.105.119.177 <none> 4317/TCP,4318/TCP 10d
service/otel-collector-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP 10d
service/otel-collector-collector-monitoring ClusterIP 10.103.168.241 <none> 8888/TCP 10d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/otel-collector-collector 1/1 1 1 10d
NAME DESIRED CURRENT READY AGE
replicaset.apps/otel-collector-collector-768f67fcfc 1 1 1 10d
Jaeger
Since the otel collector is a component that receives, processes, and sends data sent from other applications to other backends, it cannot visualize tracing data on its own.
There are several backends for visualizing tracing, and here we will use Jaeger.
Installing the Operator
Similar to the otel collector, we first install the operator and then install the Jaeger instance itself.
For installation, follow the Installing the Operator on Kubernetes section in the documentation and apply the latest manifest found in the GitHub releases to the cluster.
Creating an Instance
After installing the operator, the components necessary for Jaeger's operation are created by creating a Jaeger instance resource of kind: Jaeger.
A Jaeger instance has a property called Deployment strategy, and the available ones are as follows:
- AllInOne: Runs all components in a single pod. Collected tracing data is stored within the pod and is not persisted.
- Production: Persists data by storing it in an external storage backend.
- Streaming: In addition to the Production setup, places streaming capabilities like Kafka between Jaeger and the storage backend. Effective for large-scale environments.
AllInOne is only suitable for development purposes as the data is not persisted.
Here, we will create it with production to save the collected tracing data in an external Elasticsearch.
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: tracing
spec:
strategy: production
ingress:
enabled: false
storage:
type: elasticsearch
options:
es:
server-urls: http://elastic.centre.com:9201
index-prefix: tracing
Once the installation is complete, services like the following are created:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jaeger-agent ClusterIP None <none> 5775/UDP,5778/TCP,6831/UDP,6832/UDP,14271/TCP 3d
jaeger-collector ClusterIP 10.105.212.124 <none> 9411/TCP,14250/TCP,14267/TCP,14268/TCP,14269/TCP,4317/TCP,4318/TCP 3d
jaeger-collector-headless ClusterIP None <none> 9411/TCP,14250/TCP,14267/TCP,14268/TCP,14269/TCP,4317/TCP,4318/TCP 3d
jaeger-query ClusterIP 10.98.137.111 <none> 16686/TCP,16685/TCP,16687/TCP 3d
Each service serves as an endpoint to each component inside the pods.
- jaeger-agent: This is deprecated and overlaps in functionality with the otel-collector, so it is generally not used.
- jaeger-collector: Used as the endpoint when the otel collector sends collected and processed components.
- jaeger-query: Used when searching stored data via the web UI, etc.
Additionally, you can view the Jaeger web UI by accessing jaeger-query port 16686. Here, you can view the traces received by the Jaeger collector and stored in the backend storage.
Integration with Grafana
Since Grafana supports Jaeger by default, you can refer to Jaeger traces from the Grafana UI by specifying the jaeger-query URL as a datasource.
The data you can check is almost the same as in Jaeger, so if you prefer the Grafana UI, it might be good to view it there.

Viewing a trace in Grafana. Similar to the Jaeger UI, you can search by service, operation, and tag, and check traces and spans.
Collecting Tracing
Now that we have built the infrastructure for collecting and visualizing tracing, we will now collect tracing data from applications.
For custom applications, you can send tracing data to the otel collector primarily using the following methods:
- Write code within the application to generate tracing data using the OpenTelemetry SDK.
- Use the Auto-instrumentation feature of the otel operator.
On the other hand, when considering collecting tracing data from products other than your own (such as OSS products), the component must first support tracing.
If it is supported, instructions for sending tracing data to the otel collector or other backends are usually provided, so you would proceed with the configuration accordingly.
As an example, we will collect and visualize tracing data from several OSS components used in previous articles to see what kind of data can be retrieved.
Thanos
Thanos was used in combination with Prometheus in the article Collecting k8s cluster metrics. Thanos supports tracing data collection, allowing you to visualize internal API execution history when running queries to collect metrics.
To enable tracing, specify the settings for the otel collector as the tracing data destination in the runtime arguments for thanos query and thanos store. Here, we will add settings to the deployment manifests for store and query to send data to the otel collector running on the same cluster and then redeploy.
spec:
template:
spec:
containers:
- args:
- --http-address=0.0.0.0:19192
...
+ - |
+ --tracing.config=type: OTLP
+ config:
+ client_type: grpc
+ service_name: thanos-query
+ project_id: myproject
+ sample_factor: 16
+ insecure: true
+ endpoint: otel-collector-collector.tracing:4317
spec:
template:
spec:
containers:
- args:
- store
...
+ - |
+ --tracing.config=type: OTLP
+ config:
+ client_type: grpc
+ service_name: thanos-store
+ project_id: myproject
+ sample_factor: 16
+ insecure: true
+ endpoint: otel-collector-collector.tracing:4317
After deployment, various tracing data related to thanos-query and store will be recorded in Jaeger via the otel collector. In the previous article, we visualized cluster metrics using Grafana. Behind the scenes, thanos query executes promQL against sidecars or stores to retrieve metrics, so you can check those execution records and the time taken for queries. For example, when displaying specific metrics by specifying a time range, the query_range API is executed as follows, and its record can be checked on the Jaeger UI.

Execution record of query_range. You can check the time taken for execution, etc.
Additionally, you can check the content of the query actually executed from the span details.

The content of the executed promQL can be verified.
Tekton
Tekton, a cloud-native CI/CD platform, was used in the article Building and testing a simple CI/CD environment in a local environment - Tekton edition.
Tekton recently supported tracing configuration in v0.52.0, released in September 2023.
🔨 Add configmap for tracing config (#6897)
Tracing endpoint configuration is now moved from environment variable to the configmap config-tracing. Tracing can be now configured dynamically without needing to restart the controller. Refer the example configuration provided as part of the ConfigMap for the configuration options and format.
With Tekton tracing, you can visualize data such as how much time each task's processing takes when a task or pipeline is executed.
Tracing Configuration
There are descriptions about the tracing data transmission settings on GitHub, but it doesn't seem to be summarized in the documentation. Currently, it only supports settings for sending to the Jaeger collector in thrift format, and it seems it cannot send to the otel collector via OTLP.
When you install Tekton pipelines, a ConfigMap named config-tracing is created. Adding the tracing destination endpoint to this will enable it.
Here, we specify the thrift endpoint of the Jaeger collector on the same cluster (port 14268).
apiVersion: v1
data:
+ enabled: "true"
+ endpoint: http://jaeger-collector.tracing.svc.cluster.local:14268/api/traces
kind: ConfigMap
After configuration, tracing data will be collected under the following service names:
- taskrun-reconciler
- pipelinerun-reconciler
Verification
There are various operations, but this time we will try to check the behavior when executing a pipeline created previously.
The pipeline created in the previous article consists of the following two tasks:
- Pull source code containing a Dockerfile from GitLab.
- Build an image from the Dockerfile and push it to the registry.
The time taken when executing this pipeline can also be checked with the Tekton CLI. In this execution, it took 4 minutes and 29 seconds.
$ tkn pipelinerun list -A
NAMESPACE NAME STARTED DURATION STATUS
workspace-tekton build-image 1 hour ago 4m29s Succeeded
On the other hand, in the Jaeger UI, you can check the API execution records within this pipeline as spans. The pipeline duration confirmed here is 4 minutes and 31 seconds, which differs slightly from the above but is roughly the same.

Looking into the trace, you can find two createTaskRun records. These correspond to the APIs for creating TaskRun objects that execute each task in the pipeline, and you can check the execution time at the "Start time".
The first task starts 20.8 msec after the pipeline execution is triggered, and the second one starts at 10.61 sec.


From this, we can see that the first task completed in about 10 seconds. Meanwhile, since the total execution time of the pipeline was 4 minutes and 31 seconds, we can determine that the second task took about 4 minutes to complete. We can tell from the tracing data that the first task completed quickly as it just involved git clone, but building the image in the second task took more time.
For recent execution records, you can also check from pod execution logs or events. However, tracing data can be utilized to identify which parts of the internal API are taking time, or to compare past execution records and verify which parts are increasing in execution time when the overall execution time grows over the course of operation.
About the Architecture
Finally, let's take a look at the architecture using the otel collector and Jaeger.
The architecture for collecting tracing data in a k8s cluster is shown in a clear diagram in the Jaeger documentation.

Cited from https://www.jaegertracing.io/docs/1.50/architecture/#with-opentelemetry-collector
The difference between these two architectures is that the one on the left is a configuration where the otel collector is injected as a sidecar into the same pod as the application, while the one on the right is a configuration where the otel collector and Jaeger collector are separated from the application. In this article, since we have deployed the otel collector and Jaeger collector as separate pods via operators, it follows the architecture on the right. The pros and cons of each configuration are described in the documentation mentioned above.
Conclusion
We have built a tracing data collection infrastructure for a k8s cluster by installing the otel collector and Jaeger. While tracing may not be immediately beneficial upon implementation, it becomes essential for long-term operations—especially in environments with numerous distributed components like microservice architectures—for identifying and resolving bottlenecks.
Discussion