iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💰

Notes on EKS Container Resource Optimization

に公開

Introduction

In a recent project, I had the opportunity to optimize container resources (CPU and memory) deployed on EKS.
Most of the content is based on my experience (and assumptions) without extensive technical backing, but I would like to leave this as a memo for future reference so I can adjust my understanding as needed. 🐣

Prerequisites

  • Amazon EKS is being operated with self-managed nodes.
    • Worker nodes are managed by Auto Scaling Groups.
  • Using a combination of Spot instances, On-Demand instances, and Fargate.
    • Stateless containers (Web applications) run on Spot instances.
    • Stateful containers (job management servers, etc.) run on On-Demand instances.
    • Batch processing runs on Fargate.
  • Datadog is used as the monitoring tool.
  • It has been 5 years since operations began, but cluster-wide resource optimization has never been performed before.
    • Memory is operated with requests and limits set to the same value.
    • CPU mostly has a requests < limits relationship, with some places remaining unspecified.
    • Localized spec upgrades have been performed due to reasons such as OOM occurrences.

Basic Concepts

Resource adjustment in Kubernetes roughly revolves around the following axes, and trade-offs occur with the increase or decrease of each:

  • Container-side adjustments
    • Number of replicas
      • Decrease: Load ⬆️, Responsiveness ⬇️, Cost ⬇️
      • Increase: Load ⬇️, Responsiveness ⬆️, Cost ⬆️
    • CPU requests adjustment
      • Decrease: Bin-packing efficiency ⬆️, Node-induced throttling risk ⬆️, Cost ⬇️
      • Increase: Bin-packing efficiency ⬇️, Node-induced throttling risk ⬇️, Cost ⬆️
    • CPU limits adjustment
      • Decrease: Container-induced throttling risk ⬆️, Cost ⬇️
      • Increase: Container-induced throttling risk ⬇️, Cost ⬆️
    • Memory requests & limits adjustment
      • Decrease: Bin-packing efficiency ⬆️, OOM risk ⬆️, Cost ⬇️
      • Increase: Bin-packing efficiency ⬇️, OOM risk ⬇️, Cost ⬆️
  • Node-side adjustments
    • Number of instances
      • Decrease: Load ⬆️, Cost ⬇️
      • Increase: Load ⬇️, Cost ⬆️
    • Instance specs (vCPU / Memory)
      • Increase: Bin-packing efficiency ⬆️, Risk during downtime ⬆️, Cost ⬇️
      • Decrease: Bin-packing efficiency ⬇️, Risk during downtime ⬇️, Higher cost ⬆️

Given these relationships, resource optimization involves:

  • Within the range that does not compromise original system quality:
    • If CPU throttling occurs, the failure rate of the container's LivenessProbe / ReadinessProbe increases, leading to restarts.
      • Especially for CPU, since trends often differ between startup and post-startup, over-optimizing for the post-startup state increases the risk of hindering startup operations.
    • If OOM occurs, processes are forcibly terminated, leading to restarts.
    • Over-optimizing bin-packing efficiency compromises stability during Spot instance interruptions or EC2 failures.
      • Especially at startup, usage of CPU and disk tends to spike temporarily, so if containers without limits are concentrated on a specific node, they easily become Noisy Neighbors.
  • Adjusting container resources by:
    • Reducing requests to increase container density per node.
    • Setting or reducing limits to minimize the impact on other containers due to overcommitting.
  • Adjusting node resources by:
    • Changing the instance type to a ratio close to the container's Memory / vCPU ratio.
    • Minimizing the number of instances through automatic adjustments by Auto Scaling Groups or manual changes.

Workflow

Looking at industry case studies, there are instances where spec application is automated, especially in large-scale environments. However, as this was the first optimization for a cluster that had been left unoptimized for a long time, we handled it primarily through manual tasks.

We will proceed in the following order:

  • Monitoring
  • Container-level optimization
  • Monitoring
  • Node-level optimization
  • Monitoring

Monitoring

Make the following items observable using dashboards or other tools. It is recommended to observe the state before and after applying changes to the environment.

Note that the metrics listed are for Datadog; please adapt them if using other monitoring platforms.

  • Container
    • Health
      • Number of restarts: kubernetes.containers.restarts
      • Number of evictions: kubernetes.kubelet.evictions
    • Memory amount / Number of vCPUs
      • Allocation-based: kubernetes.memory.requests / kubernetes.cpu.requests
      • Usage-based: kubernetes.memory.usage / kubernetes.cpu.usage.total
    • CPU
      • requests / limits values
      • Usage: kubernetes.cpu.usage.total
      • CPU throttling status
        • Throttled CPU time: kubernetes.cpu.cfs.throttled.seconds
        • Occurrence rate: (kubernetes.cpu.cfs.throttled.periods / kubernetes.cpu.cfs.periods) * 100
    • Memory
      • requests / limits values
      • Usage: kubernetes.memory.usage / Utilization: kubernetes.memory.usage_pct
      • OOM status
        • kubernetes.containers.last_state.terminated filtered by reason:oomkilled
        • It is also recommended to check events from Datadog with everything except -status: info.
    • Disk
      • ephemeral-storage limits value
      • Disk usage: kubernetes.ephemeral_storage.usage
      • Disk read bytes: kubernetes.io.read_bytes
      • Disk write bytes: kubernetes.io.write_bytes
    • Network I/O
      • Transmission
        • Byte count: kubernetes.network.tx_dropped
        • Packet drop count: kubernetes.network.tx_dropped
        • Error count: kubernetes.network.tx_errors
      • Reception
        • Byte count: kubernetes.network.rx_dropped
        • Packet drop count: kubernetes.network.rx_dropped
        • Error count: kubernetes.network.rx_errors
  • Worker Node
    • Aggregate metrics used for containers by units such as instance-type or host as appropriate.
    • Health
      • Count and status: kubernetes_state.node.status
    • CPU
      • EC2 utilization: 100 - system.cpu.idle
    • Memory
      • EC2 utilization: 1 - system.mem.pct_usable
  • Cost
    • Use Cost Explorer to view EC2 Instances (Elastic Compute Cloud - Compute) by Usage Type, etc.
      • If you are using Auto Scaling Groups, you should be able to filter by tags.
    • You can view EKS-specific costs via Cost and Usage Reports (CUR), so you can use that for deeper individual analysis.
      • Which namespace's resources are incurring costs
      • The cost ratio between jobs and applications
  • Application (meaning viewing your usual monitoring metrics as well)
    • Request volume
    • Error rate
    • Latency

https://aws.amazon.com/jp/about-aws/whats-new/2024/04/aws-split-cost-allocation-data-amazon-eks/

Container-Level Optimization

First, it is necessary to decide what new requests / limits to specify for each container.
It is best to extract CPU and memory usage for a certain period and set values that include a certain margin over the max or p99 values.

For example, there is an OSS called Goldilocks that provides resource recommendations by automatically creating VPA in Off mode for containers under a specific namespace.
Note that you need to wait about a week after application for the recommendations to stabilize, so keep this in mind when using it.
Values can be obtained via the unofficial API or by using commands like kubectl get vpa -A -ojson.

https://github.com/FairwindsOps/goldilocks

https://kubernetes.io/docs/concepts/workloads/autoscaling/#scaling-workloads-vertically

Alternatively, monitoring tools like Datadog may provide API Clients for various languages, which you can use to retrieve data in bulk.

https://github.com/DataDog/datadog-api-client-python

An example query is shown below.
After running this for a certain period (e.g., the last two weeks), you can select values such as max or the 99th percentile from the trends of each container and set values with an added margin as the recommended specifications.
The appropriate period to target depends on the characteristics of the system you are handling, such as whether it is a system that experiences peak traffic.

# Get average CPU values every 10 minutes
avg:kubernetes.cpu.usage.total{cluster_name:xxxxx} by {kube_namespace,kube_deployment,kube_stateful_set,kube_daemon_set,kube_container_name,kube_job}.rollup(avg, 600)

# Get maximum memory values every 10 minutes
max:kubernetes.memory.usage{cluster_name:xxxxx} by {kube_namespace,kube_deployment,kube_stateful_set,kube_daemon_set,kube_container_name,kube_job}.rollup(max, 600)

However, since the values mentioned above are just snapshots in time, for business-critical components, it is better to investigate and consider them individually to set appropriate resource specs.
There were cases where the recommended values from Goldilocks (VPA) and Datadog differed significantly.

Also, when configuring OSS software, system requirements are often defined by the software itself, so in many cases, specs cannot always be calculated solely based on usage. For example, Rundeck has a minimum requirement of 2CPU & 8GB RAM, so it is necessary to ensure that values lower than these are not configured.

https://docs.rundeck.com/docs/administration/install/system-requirements.html

As a method for embedding the information that "specific resources must have certain specs" directly into the resource manifests, conftest can be effectively utilized.

https://zenn.dev/yktakaha4/articles/policy_check_with_conftest

When recording that Rundeck is deployed with specific specs in annotations:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rundeck
  namespace: rundeck
  labels:
    app: rundeck
  annotations:
    # The Server Profile (Minimum) requires the following:
    # - 2 CPUs per instance
    # - 8GB RAM (4GB JVM Heap)
    # https://docs.rundeck.com/docs/administration/install/system-requirements.html
    "repo-name/explicit-cpu-request-rundeck": "2000m"
    "repo-name/explicit-cpu-limit-rundeck": "2000m"
    "repo-name/explicit-memory-request-rundeck": "8000M"
    "repo-name/explicit-memory-limit-rundeck": "8000M"
# (Omitted)

With the following policy, you can trigger an error if the container's requests / limits do not match the explicitly stated specs. An example Rego policy for requests.cpu is provided below:

package pod

target_resource_types := {"DaemonSet", "Deployment", "StatefulSet", "Job", "CronJob", "ReplicaSet", "ReplicationController"}

deny_explicit_resource_spec_not_set_cpu_request[msg] {
	target_resource_types[input.kind]
	container := input.spec.template.spec.containers[_]

	explicit_resource_spec := get_explicit_resource_spec("cpu-request", container.name, input)

	explicit_resource_spec
	not container.resources.requests.cpu

	msg := sprintf("%s: CPU request for container %s is not specified against the explicit spec %s", [input.metadata.name, container.name, explicit_resource_spec])
}

deny_explicit_resource_spec_not_match_cpu_request[msg] {
	target_resource_types[input.kind]
	container := input.spec.template.spec.containers[_]

	container_resouce_spec := container.resources.requests.cpu
	explicit_resource_spec := get_explicit_resource_spec("cpu-request", container.name, input)

	container_resouce_spec != explicit_resource_spec

	msg := sprintf("%s: CPU request %s for container %s does not match the explicit spec %s", [input.metadata.name, container.name, container_resouce_spec, explicit_resource_spec])
}

get_explicit_resource_spec(resource_kind, container_name, resource) = explicit_resource_spec {
	annotations := resource.metadata.annotations
	resource_spec_key := sprintf("repo-name/explicit-%s-%s", [resource_kind, container_name])
	explicit_resource_spec := annotations[resource_spec_key]
}

Regarding the calculation of recommended specs, while it depends on the number of components deployed in the environment, it is desirable to semi-automate the process using scripts or similar tools.

Node-Level Optimization

The following points should be carefully observed immediately after applying container resource optimization to the environment:

  • Is the cluster stable?
    • Based on the criteria explained in the Monitoring section, check for Pods experiencing frequent restarts. Common issues include:
      • Insufficient CPU per container due to excessive spec reduction or increased container density > CPU throttling occurs or node CPU utilization maxes out > LivenessProbe fails > Restarts increase.
      • Memory shortage due to excessive spec reduction > OOM occurs > Restarts increase.
        • Especially since containers that haven't been adjusted, such as those in kube-system, may be affected if they don't have memory limits, you should check each failing Pod for potential issues.
      • Increased container density reduces available ephemeral storage per container > Disk exhaustion > Eviction occurs > Restarts increase.
  • Is the cluster optimized?
    • The following should change before and after application:
      • Total CPU Requests and Memory Requests across the cluster decrease.
        • If they don't, there might have been a mistake in the applied values.
      • The {CPU, Memory} allocation rate per instance type increases.
        • The higher this increases, the more node utilization has improved (meaning bin-packing efficiency has increased).
        • It's possible for one to increase while the other stays the same or decreases.
          • This indicates that the Memory/vCPU ratio of the containers deviates from the EC2 instance types being used.
      • The Memory/vCPU ratio changes.
        • You should select instance types that are appropriate for this new ratio.

If you are using autoscaling with Cluster Autoscaler, worker nodes may be automatically reduced just by optimizing at the container level, potentially lowering costs. However, since the appropriate instance type for the environment may have changed along with the shift in the Memory/vCPU ratio, you should continue with worker node optimization.

The following diagram shows the total Memory Usage / CPU Usage of containers deployed on a specific worker node.
While it was 4GB / 1vCPU before the adjustment, it became 8GB / 1vCPU after optimization. Therefore, selecting an instance with specs like 32GB / 4vCPU is expected to increase bin-packing efficiency and make it easier to reduce costs.


Image of the observed ratios before and after spec changes

However, simply selecting instances based on this criteria alone increases the risk of throttling if CPU is overcommitted. Considering that M-family instances, which are suitable for standard workloads, have a ratio of 4GB / 1vCPU, it is likely better to avoid choosing instance types with extreme ratios.

Additionally, selecting instance types requires significant prerequisite knowledge, so it may be safer to understand their characteristics beforehand.

https://www.youtube.com/watch?v=6zLr3LF9GYA

Once you have selected an instance type, apply it to a staging environment and test the applications.
It is recommended to check the following:

  • Whether Pod restarts occur during high-load periods, such as initial processing during deployment or specific application operations.
    • In cases where limits are not set, pay close attention as this can affect not only the high-load Pod but also other components running on the same worker node.
    • Monitoring components like the Datadog Agent might also be affected, so it may be better to check the cluster status directly using commands like kubectl get pod -A -owide.

Also, while it may seem that a configuration like one 64GB / 16core instance would be cheaper than two 32GB / 8core instances, if you are running applications on Spot instances, using a single instance increases the risk of service downtime due to Spot interruptions or AZ failures.
You will need to refine this based on resource usage and error status after container optimization.

Furthermore, it is desirable to configure components for distributed placement as a preceding step to resource optimization.

https://creators.oisixradaichi.co.jp/entry/2023/01/12/150101

https://cstoku.dev/posts/2018/k8sdojo-18/

The following outlines the points to check when selecting instances.

It is easy to understand by opening the AWS Auto Scaling Group screen with Read-Only permissions and looking at Specify instance attributes under Instance type requirements.


Instance type requirements screen

Below are the points to note for each component.

  • vCPU
    • Since the number of cores determines the maximum number of containers that can run in parallel, it seems preferable to have at least 4 cores.
      • If CPU is configured with requests < limits, a low absolute number of cores makes it easier for the CPU to become exhausted.
  • Memory
    • Increasing memory per instance too much to improve bin-packing efficiency increases the risk of downtime during Spot instance interruptions.
      • This is because it takes a few minutes for a new Spot instance to spin up.
    • Conversely, reducing memory too much increases the proportion of resources that run per node (such as DaemonSets), leading to poor cost efficiency.
    • When operating a self-managed node group with Cluster Autoscaler for scaling, instance weighting cannot be configured (identical values must be specified), so it is desirable to keep the memory size consistent across instance types.
      • Assigning different values between instance types causes a discrepancy between the number of instances Cluster Autoscaler expects and the actual count, which can lead to issues.
      • While it is also ideal for the vCPU count to be consistent, doing so makes it difficult to meet the requirement of having 20 or more Spot pools (described below). Since many cases explicitly define requests / limits and the total cluster requirement for memory is easier to estimate, I believe it is better to fix the memory side (though I'm not entirely certain of the "correct" answer).
        • If the Auto Scaling Group is set to the price-capacity-optimized allocation strategy, it should basically pick the cheaper, lower-core instances if the memory is the same.
  • Number of instance types (≈ Number of Spot pools)
    • As discussed in this case study, it is generally said that the risk of interruption is high if there are fewer than 20 pools, so we aim for at least 7 instance types * 3 AZs = 21 Spot pools.
      • To configure all stateless servers with Spot instances, DeNA ultimately increased the number of Spot pools used to 20. For example, at one point, for five Availability Zones in the Northern Virginia region, they defined four instance types: c5.2xlarge, c5.4xlarge, c5d.4xlarge, and c5.9xlarge. The pool count is calculated as 5 x 4 = 20 pools.

https://aws.amazon.com/blogs/aws/how-dena-succesfully-applied-ec2-spot-on-production-and-reference-architecture-using-containers/

Once you have estimated the instance types, check the cost per instance from the pricing tables.
As a prerequisite, note that Spot instance prices are not proportional to specs and are updated frequently, so this information should be treated as a reference only.


Price List

You can calculate the projected cost after implementation by comparing the specs and quantity of currently running instances with the specs and quantity of instances planned for the future.

References

https://www.datadoghq.com/ja/blog/kubernetes-cpu-requests-limits/

https://docs.aws.amazon.com/ja_jp/eks/latest/best-practices/cost-opt-compute.html

https://developers.freee.co.jp/entry/Approach-to-increasing-cost-of-computing-resources-in-EKS-environment

https://speakerdeck.com/sanposhiho/merukariniokerupuratutohuomuzhu-dao-nokubernetesrisosuzui-shi-hua-tosokonisheng-mareta-noke-neng-xing

Discussion