iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🎃

Visualizing Metrics and Trace Data with Cloud Run Sidecars

に公開

Introduction

Recently, sidecar container support was added to Cloud Run.
https://cloud.google.com/blog/products/serverless/cloud-run-now-supports-multi-container-deployments?hl=en

Previously, Cloud Run could only use a single container, but now you can deploy two or more multiple containers.
This allows you to use Envoy as a sidecar or place an OpenTelemetry collector to offload distributed tracing configurations from the application.

Goal

Using this mechanism, I would like to outline the process of sending Prometheus-compatible metrics and trace information from a Cloud Run application via a sidecar OpenTelemetry collector to Victoria Metrics / Grafana Tempo locally, and to Cloud Monitoring / Cloud Trace on Cloud Run for visualization.

The final architecture is expected to look like this:

The verification code used this time is managed in the following repository:
https://github.com/tetsuya28/cloud-run-multiple-containers-observability

Setup

Preparation

First, we will add the implementation to the application side to output Prometheus metrics and trace information.

I am using a sample application built with Echo.
https://echo.labstack.com/

Since the Echo implementation itself is out of scope this time, I will only provide a brief excerpt.

Metrics

The settings for outputting metrics for Prometheus exist as an Echo library, so we will use that.
https://echo.labstack.com/docs/middleware/prometheus

import (
	"github.com/labstack/echo-contrib/echoprometheus"
	"github.com/labstack/echo/v4"
	"github.com/labstack/echo/v4/middleware"
)

func main() {
	e := echo.New()
	e.Use(echoprometheus.NewMiddleware(DefaultComponentName))
	e.GET("/metrics", echoprometheus.NewHandler())
}

By adding the implementation above, you can output HTTP-related metrics.

# HELP app_request_duration_seconds The HTTP request latencies in seconds.
# TYPE app_request_duration_seconds histogram
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.005"} 0
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.01"} 4
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.025"} 35
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.05"} 36
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.1"} 37
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.25"} 37
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="0.5"} 37
app_request_duration_seconds_bucket{code="200",host="app:8080",method="GET",url="/metrics",le="1"} 37

Trace Information

Tracing follows the OpenTelemetry implementation.
In this setup, we use the OpenTelemetry Collector as a sidecar for the exporter.
The OpenTelemetry Collector can receive requests via both HTTP and gRPC, but we are using gRPC in this instance.

Create an exporter as follows:

func NewExporter(ctx context.Context, cfg *config.Config) (sdktrace.SpanExporter, error) {
	client := otlptracegrpc.NewClient(
		otlptracegrpc.WithInsecure(),
		// Specify the OpenTelemetry Collector endpoint here
		// When using docker-compose locally, it is otel:4317
		otlptracegrpc.WithEndpoint(cfg.OtelCollectorEndpoint),
		otlptracegrpc.WithDialOption(grpc.WithBlock()),
	)

	exporter, err := otlptrace.New(ctx, client)
	if err != nil {
		return nil, err
	}

	return exporter, nil
}

Create a trace provider using the created exporter.

	r := resource.NewWithAttributes(
		semconv.SchemaURL,
		semconv.ServiceNameKey.String(DefaultComponentName),
	)

	traceProvider := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithResource(r),
	)

	otel.SetTracerProvider(traceProvider)

Call the generated trace provider within any function to generate spans.

func home(c echo.Context) error {
	_, span := tracer.Start(c.Request().Context(), "home")
	defer span.End()
	return c.JSON(http.StatusOK, nil)
}

OpenTelemetry Collector

We will build the OpenTelemetry Collector as a component to receive metrics and trace information.
https://opentelemetry.io/docs/collector/

Local

This describes the OpenTelemetry Collector configuration for running locally.

config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
  prometheus:
    config:
      global:
        external_labels: {}
      scrape_configs:
        - job_name: cloud-run-otel
          scrape_interval: 10s
          static_configs:
            - targets:
                - localhost:8888
        - job_name: cloud-run
          scrape_interval: 10s
          metrics_path: /metrics
          static_configs:
            - targets:
                - app:8080
exporters:
  prometheusremotewrite:
    endpoint: http://victoriametrics:8428/api/v1/write
  otlp:
    endpoint: http://tempo:4317
    tls:
      insecure: true
service:
  telemetry:
    logs:
      level: WARN
      encoding: json
  extensions:
    - health_check
  pipelines:
    metrics:
      receivers:
        - prometheus
      exporters:
        - prometheusremotewrite
    traces:
      receivers:
        - otlp
      exporters:
        - otlp
extensions:
  health_check: null

Cloud Run

This describes the OpenTelemetry Collector configuration for running on Cloud Run.

config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
  prometheus:
    config:
      global:
        external_labels:
          service: ${K_SERVICE}
          revision: ${K_REVISION}
      scrape_configs:
        - job_name: cloud-run-otel
          scrape_interval: 10s
          static_configs:
            - targets:
                - localhost:8888
        - job_name: cloud-run
          scrape_interval: 10s
          metrics_path: /metrics
          static_configs:
            - targets:
                - localhost:8080
exporters:
  googlemanagedprometheus:
    project: ${PROJECT_ID}
  googlecloud:
    trace:
      endpoint: cloudtrace.googleapis.com:443
service:
  telemetry:
    logs:
      level: WARN
      encoding: json
  extensions:
    - health_check
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
        - resourcedetection
        - resource
      exporters:
        - googlemanagedprometheus
    traces:
      receivers:
        - otlp
      exporters:
        - googlecloud
processors:
  batch:
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s
  resourcedetection:
    detectors:
      - env
      - gcp
  resource:
    attributes:
      - key: service.name
        value: ${K_SERVICE}
        action: upsert
      - key: service.instance.id
        from_attribute: faas.id
        action: insert
extensions:
  health_check: null

Configuration Details

First, we describe the receiver configuration for the OpenTelemetry Collector to receive metrics and tracing information. As mentioned earlier, we define settings to receive tracing information via gRPC.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"

For metrics, we use a configuration similar to Prometheus settings to scrape /metrics from the application container. To make monitoring easier during service operation, we add the Cloud Run reserved environment variables K_SERVICE and K_REVISION as labels to all metrics using global.external_labels, allowing us to distinguish metrics by Cloud Run revision.

receivers:
  prometheus:
    config:
      global:
        external_labels:
          # Adding Cloud Run revision information as metric labels
          service: ${K_SERVICE}
          revision: ${K_REVISION}
      scrape_configs:
        - job_name: cloud-run-otel
          scrape_interval: 10s
          static_configs:
            - targets:
                - localhost:8888
        - job_name: cloud-run
          scrape_interval: 10s
          metrics_path: /metrics
          static_configs:
            - targets:
                - app:8080

When running on Cloud Run, the application container and the OpenTelemetry Collector belong to the same network, so static_configs must be changed to the following.

          static_configs:
            - targets:
                - localhost:8080

Next, we describe the exporter settings for sending the metrics and trace information received by the OpenTelemetry Collector to the actual data stores.
Locally, we use Victoria Metrics as the data store for metrics.
https://victoriametrics.com/

exporters:
  prometheusremotewrite:
    endpoint: http://victoriametrics:8428/api/v1/write

When running on Cloud Run, we use Google-managed Prometheus, so we use the following configuration for the exporter.
*Note: PROJECT_ID is set as an environment variable in the OpenTelemetry Collector container on Cloud Run.

exporters:
  googlemanagedprometheus:
    project: ${PROJECT_ID}

For local trace information, we use Grafana Tempo.
https://grafana.com/oss/tempo/

exporters:
  otlp:
    endpoint: http://tempo:4317
    tls:
      insecure: true

When running on Cloud Run, we use Cloud Trace, so we use the following configuration for the exporter.

exporters:
  googlecloud:
    trace:
      endpoint: cloudtrace.googleapis.com

Then, we describe the local service configuration to route these settings.

service:
  telemetry:
    logs:
      level: ERROR
      encoding: json
  extensions:
    - health_check
  pipelines:
    metrics:
      receivers:
        - prometheus
      exporters:
        - prometheusremotewrite
    traces:
      receivers:
        - otlp
      exporters:
        - otlp
extensions:
  health_check: null

For Cloud Run, we use the following configuration.

service:
  telemetry:
    logs:
      level: WARN
      encoding: json
  extensions:
    - health_check
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
        - resourcedetection
        - resource
      exporters:
        - googlemanagedprometheus
    traces:
      receivers:
        - otlp
      exporters:
        - googlecloud

Additionally, for Cloud Run, we add processor settings to automatically insert the necessary configurations for Google-managed Prometheus.
For resourcedetection, we refer to this page.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/README.md#google-cloud-run-services-metadata

processors:
  batch:
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s
  resourcedetection:
    detectors:
      - env
      - gcp
  resource:
    attributes:
      - key: service.name
        value: ${K_SERVICE}
        action: upsert
      - key: service.instance.id
        from_attribute: faas.id
        action: insert

Data Stores

We won't go into the details of Victoria Metrics or Tempo configurations as they are out of scope for this article, but the settings for running them locally are provided in the GitHub repository if you are interested.

Verification

Verification Locally

I have created a Docker Compose environment so that we can run all the settings mentioned above locally. Let's start it up and check the operations.

Use the following command to start the environment.

docker compose up -d

Once it is up, access Grafana at localhost:3000.
The default login credentials are admin for both the username and password.

After logging in, verify the metrics and trace information in Explore.

For metrics, with the data source set to VictoriaMetrics, you can confirm that Echo's metrics are being visualized by running a query like the one below.

promhttp_metric_handler_requests_total

For trace information, you can check the trace information emitted from the application by setting the data source to Tempo.

Verification on Cloud Run

Now that we have completed the operational check locally, I will describe the settings and configuration for actually running it on Cloud Run.

The Cloud Run settings use a Knative manifest.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: cloud-run-mco
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '1'
    spec:
      containerConcurrency: 1
      timeoutSeconds: 300
      serviceAccountName: "" # Specify a Google Service Account with roles/cloudtrace.agent and roles/monitoring.metricWriter permissions
      containers:
      - name: app
        image: "" # Specify the built application container
        env:
        - name: OTEL_COLLECTOR_ENDPOINT
          value: localhost:4317
        ports:
        - name: http1
          containerPort: 8080
        resources:
          limits:
            cpu: 500m
            memory: 256Mi
        startupProbe:
          timeoutSeconds: 5
          periodSeconds: 5
          failureThreshold: 3
          httpGet:
            path: /
            port: 8080
      - image: "" # Specify the built OpenTelemetry Collector container
        env:
        - name: PROJECT_ID
          value: "" # Specify the GCP project you are running in
        resources:
          limits:
            cpu: 200m
            memory: 128Mi
        startupProbe:
          initialDelaySeconds: 10
          timeoutSeconds: 10
          periodSeconds: 30
          failureThreshold: 3
          httpGet:
            path: /

You can confirm that the metrics and trace information are visualized in Cloud Monitoring and Cloud Trace, respectively.


Summary

By delegating the transmission and collection of metrics and trace information from the application implementation to the OpenTelemetry Collector, it is now possible to achieve a similar monitoring mechanism not only in Cloud Run but in any environment.

OpenTelemetry Collector settings need to be managed in YAML, and the configurations often differ between local and Cloud Run. While not introduced in this article, by using tools like Cue to apply differences per environment while maintaining commonality, managing configuration files can become quite simple.
If you are interested, please check the GitHub repository.

If you have any feedback or suggestions, I look forward to hearing from you in the comments or on X (Twitter).

Discussion