iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🚜

Distributed Tracing: Prerequisite Knowledge

に公開

Overview

I have summarized the prerequisite knowledge for distributed tracing.

What is Distributed Tracing?

Distributed tracing is a solution primarily used in distributed architecture environments such as microservices.
In a distributed architecture environment, multiple microservices often need to be called behind the scenes before returning a single request. This leads to problems such as difficulty in identifying the root cause during a failure or challenges in analyzing the system as a whole.
By introducing distributed tracing, a unique trace ID is assigned to each request. This makes it possible to visualize information (traces), such as which microservices the request communicated with and where processing time was consumed.

Architecture of Distributed Tracing

To achieve distributed tracing, a mechanism is required for each microservice to send logs and for those logs to be visualized (= distributed tracing provider).
To implement this, the following components are required.
Strictly speaking, the configuration may change depending on the distributed tracing provider, so please check each provider's official website.

  • Reporter that sends logs
    • Depending on the distributed tracing provider, the reporter may be provided as an SDK for each language
    • In an Anthos environment, istio-proxy is responsible for this
  • Storage that saves logs
    • In the case of Cloud Trace, Cloud Logging is used
  • Interface that visualizes traces (GUI)
    • In the case of Cloud Trace, this refers to the Cloud Trace console screen

Logs are sent from the reporter to the storage in the following flow, and the trace information is visualized through the visualization interface.

Distributed Tracing Providers

Providers for achieving distributed tracing include Zipkin, Jaeger, Cloud Trace, and Grafana Tempo.

https://landscape.cncf.io/card-mode?category=tracing&grouping=category

Components of Trace Information in Distributed Tracing

This section describes what information a single trace in distributed tracing consists of.

Trace

In distributed tracing, an entire single request is defined as a trace.
One trace is generated for each request.

Span

Individual requests called internally for a single request are defined as spans.
A single log is associated with a single span, and these are collected from the reporter of each application.

A concrete example is shown below.
In the following sample application, polaris acts as a gateway, and aldebaran, aquarius, and http://google.com are called in series from polaris.
aldebaran internally calls http://google.com. Also, aquarius calls sadalmelik behind it.
In this example, the request to polaris.polaris.svc.cluster.local:8080/* (the endpoint accessed by the user via Ingress) at the top is a single trace, linked to one request.
The logs for each backend microservice called from polaris are spans.

Note: Some spans are partially omitted.

Assignment of Trace IDs in Distributed Tracing

In distributed tracing, to manage which request each microservice's access log is associated with, a trace ID must be linked to each log. Therefore, the reporter for each distributed tracing provider needs to issue trace IDs.
The following describes the assignment of trace IDs in istio-proxy, which acts as the reporter in Anthos Service Mesh.
If a trace ID is not present when istio-proxy receives a request, it assigns a trace ID. If a trace ID is already present, it uses that trace ID as is. After assigning the trace ID, istio-proxy sends the log to the storage.

Responsibilities of Distributed Tracing Providers and Applications

Responsibility of the Distributed Tracing Provider

The distributed tracing provider is responsible for assigning new trace IDs, storing logs and trace information emitted by reporters, and visualizing them.
It visualizes traces associated with a specific request based on the trace ID.
Since the trace ID used here varies depending on the provider, it is necessary to use the appropriate trace ID identifier for the provider you want to use for distributed tracing.
The identifier for the trace ID assigned by the reporter (e.g., x-trace-id) differs depending on the distributed tracing provider, so please refer to the documentation of the provider you are using.
In the case of Cloud Trace, x-cloud-trace-context is used as the identifier. [1]

Responsibility of the Application

The reporter cannot correlate the incoming connection request to an application with the outgoing requests issued from that application.
When an application calls another application, it must transfer the trace ID from the incoming request into the headers of the outgoing call to propagate the trace ID. If this propagation is not performed, a new trace ID will be issued at the destination application, and the data will be registered as a separate trace.
Since the actual header identifier to be transferred varies depending on the distributed tracing provider, you must transfer the header used by the specific provider during implementation.

Others

Sampling Algorithms

Many distributed tracing solutions are equipped with a sampling function to visualize only a certain percentage of requests as traces.
In the case of istio-proxy (Envoy), it performs Head-based sampling.
Head-based sampling is an algorithm where, if trace information is already present when Envoy receives a request, it does not include it in the sampling pool but respects the caller's (Head) decision to trace.
This prevents trace information from being lost midway due to sampling at the destination when it has already been traced at the source.

How to Use Grafana Tempo as a Distributed Tracing Provider

Grafana Tempo can be used as a distributed tracing provider for Anthos by specifying the istio-proxy trace option as Zipkin.

https://grafana.com/blog/2021/08/31/how-istio-tempo-and-loki-speed-up-debugging-for-microservices/

  • Note: Since the trace ID identifiers used by distributed tracing providers differ between Cloud Trace and Grafana Tempo, the middleware implementation on the application side also needs to be switched.

  • Note: Multiple distributed tracing providers cannot be used simultaneously.

脚注
  1. https://istio.io/latest/docs/tasks/observability/distributed-tracing/overview/#trace-context-propagation ↩︎

Discussion