iTranslated by AI
Notes on EKS Container Resource Optimization
Introduction
In a recent project, I had the opportunity to optimize container resources (CPU and memory) deployed on EKS.
Most of the content is based on my experience (and assumptions) without extensive technical backing, but I would like to leave this as a memo for future reference so I can adjust my understanding as needed. 🐣
Prerequisites
- Amazon EKS is being operated with self-managed nodes.
- Worker nodes are managed by Auto Scaling Groups.
- Using a combination of Spot instances, On-Demand instances, and Fargate.
- Stateless containers (Web applications) run on Spot instances.
- Stateful containers (job management servers, etc.) run on On-Demand instances.
- Batch processing runs on Fargate.
- Compute Savings Plans are purchased based on usage.
- Datadog is used as the monitoring tool.
- It has been 5 years since operations began, but cluster-wide resource optimization has never been performed before.
- Memory is operated with
requestsandlimitsset to the same value. - CPU mostly has a
requests < limitsrelationship, with some places remaining unspecified. - Localized spec upgrades have been performed due to reasons such as OOM occurrences.
- Memory is operated with
Basic Concepts
Resource adjustment in Kubernetes roughly revolves around the following axes, and trade-offs occur with the increase or decrease of each:
- Container-side adjustments
- Number of
replicas- Decrease: Load ⬆️, Responsiveness ⬇️, Cost ⬇️
- Increase: Load ⬇️, Responsiveness ⬆️, Cost ⬆️
- CPU
requestsadjustment- Decrease: Bin-packing efficiency ⬆️, Node-induced throttling risk ⬆️, Cost ⬇️
- Increase: Bin-packing efficiency ⬇️, Node-induced throttling risk ⬇️, Cost ⬆️
- CPU
limitsadjustment- Decrease: Container-induced throttling risk ⬆️, Cost ⬇️
- Increase: Container-induced throttling risk ⬇️, Cost ⬆️
- Memory
requests & limitsadjustment- Decrease: Bin-packing efficiency ⬆️, OOM risk ⬆️, Cost ⬇️
- Increase: Bin-packing efficiency ⬇️, OOM risk ⬇️, Cost ⬆️
- Number of
- Node-side adjustments
- Number of instances
- Decrease: Load ⬆️, Cost ⬇️
- Increase: Load ⬇️, Cost ⬆️
- Instance specs (vCPU / Memory)
- Increase: Bin-packing efficiency ⬆️, Risk during downtime ⬆️, Cost ⬇️
- Decrease: Bin-packing efficiency ⬇️, Risk during downtime ⬇️, Higher cost ⬆️
- Number of instances
Given these relationships, resource optimization involves:
- Within the range that does not compromise original system quality:
- If CPU throttling occurs, the failure rate of the container's LivenessProbe / ReadinessProbe increases, leading to restarts.
- Especially for CPU, since trends often differ between startup and post-startup, over-optimizing for the post-startup state increases the risk of hindering startup operations.
- If OOM occurs, processes are forcibly terminated, leading to restarts.
- Over-optimizing bin-packing efficiency compromises stability during Spot instance interruptions or EC2 failures.
- Especially at startup, usage of CPU and disk tends to spike temporarily, so if containers without limits are concentrated on a specific node, they easily become Noisy Neighbors.
- If CPU throttling occurs, the failure rate of the container's LivenessProbe / ReadinessProbe increases, leading to restarts.
- Adjusting container resources by:
- Reducing
requeststo increase container density per node. - Setting or reducing
limitsto minimize the impact on other containers due to overcommitting.
- Reducing
- Adjusting node resources by:
- Changing the instance type to a ratio close to the container's
Memory / vCPUratio. - Minimizing the number of instances through automatic adjustments by Auto Scaling Groups or manual changes.
- Changing the instance type to a ratio close to the container's
Workflow
Looking at industry case studies, there are instances where spec application is automated, especially in large-scale environments. However, as this was the first optimization for a cluster that had been left unoptimized for a long time, we handled it primarily through manual tasks.
We will proceed in the following order:
- Monitoring
- Container-level optimization
- Monitoring
- Node-level optimization
- Monitoring
Monitoring
Make the following items observable using dashboards or other tools. It is recommended to observe the state before and after applying changes to the environment.
Note that the metrics listed are for Datadog; please adapt them if using other monitoring platforms.
- Container
- Health
- Number of restarts:
kubernetes.containers.restarts - Number of evictions:
kubernetes.kubelet.evictions
- Number of restarts:
- Memory amount / Number of vCPUs
- Allocation-based:
kubernetes.memory.requests / kubernetes.cpu.requests - Usage-based:
kubernetes.memory.usage / kubernetes.cpu.usage.total
- Allocation-based:
- CPU
-
requests/limitsvalues - Usage:
kubernetes.cpu.usage.total - CPU throttling status
- Throttled CPU time:
kubernetes.cpu.cfs.throttled.seconds -
Occurrence rate:
(kubernetes.cpu.cfs.throttled.periods / kubernetes.cpu.cfs.periods) * 100
- Throttled CPU time:
-
- Memory
-
requests/limitsvalues - Usage:
kubernetes.memory.usage/ Utilization:kubernetes.memory.usage_pct - OOM status
-
kubernetes.containers.last_state.terminatedfiltered byreason:oomkilled - It is also recommended to check events from Datadog with everything except
-status: info.
-
-
- Disk
-
ephemeral-storagelimitsvalue - Disk usage:
kubernetes.ephemeral_storage.usage - Disk read bytes:
kubernetes.io.read_bytes - Disk write bytes:
kubernetes.io.write_bytes
-
- Network I/O
- Transmission
- Byte count:
kubernetes.network.tx_dropped - Packet drop count:
kubernetes.network.tx_dropped - Error count:
kubernetes.network.tx_errors
- Byte count:
- Reception
- Byte count:
kubernetes.network.rx_dropped - Packet drop count:
kubernetes.network.rx_dropped - Error count:
kubernetes.network.rx_errors
- Byte count:
- Transmission
- Health
- Worker Node
- Aggregate metrics used for containers by units such as
instance-typeorhostas appropriate. - Health
- Count and status:
kubernetes_state.node.status
- Count and status:
- CPU
- EC2 utilization:
100 - system.cpu.idle
- EC2 utilization:
- Memory
- EC2 utilization:
1 - system.mem.pct_usable
- EC2 utilization:
- Aggregate metrics used for containers by units such as
- Cost
- Use Cost Explorer to view
EC2 Instances (Elastic Compute Cloud - Compute)byUsage Type, etc.- If you are using Auto Scaling Groups, you should be able to filter by tags.
- You can view EKS-specific costs via Cost and Usage Reports (CUR), so you can use that for deeper individual analysis.
- Which namespace's resources are incurring costs
- The cost ratio between jobs and applications
- Use Cost Explorer to view
- Application (meaning viewing your usual monitoring metrics as well)
- Request volume
- Error rate
- Latency
Container-Level Optimization
First, it is necessary to decide what new requests / limits to specify for each container.
It is best to extract CPU and memory usage for a certain period and set values that include a certain margin over the max or p99 values.
For example, there is an OSS called Goldilocks that provides resource recommendations by automatically creating VPA in Off mode for containers under a specific namespace.
Note that you need to wait about a week after application for the recommendations to stabilize, so keep this in mind when using it.
Values can be obtained via the unofficial API or by using commands like kubectl get vpa -A -ojson.
Alternatively, monitoring tools like Datadog may provide API Clients for various languages, which you can use to retrieve data in bulk.
An example query is shown below.
After running this for a certain period (e.g., the last two weeks), you can select values such as max or the 99th percentile from the trends of each container and set values with an added margin as the recommended specifications.
The appropriate period to target depends on the characteristics of the system you are handling, such as whether it is a system that experiences peak traffic.
# Get average CPU values every 10 minutes
avg:kubernetes.cpu.usage.total{cluster_name:xxxxx} by {kube_namespace,kube_deployment,kube_stateful_set,kube_daemon_set,kube_container_name,kube_job}.rollup(avg, 600)
# Get maximum memory values every 10 minutes
max:kubernetes.memory.usage{cluster_name:xxxxx} by {kube_namespace,kube_deployment,kube_stateful_set,kube_daemon_set,kube_container_name,kube_job}.rollup(max, 600)
However, since the values mentioned above are just snapshots in time, for business-critical components, it is better to investigate and consider them individually to set appropriate resource specs.
There were cases where the recommended values from Goldilocks (VPA) and Datadog differed significantly.
Also, when configuring OSS software, system requirements are often defined by the software itself, so in many cases, specs cannot always be calculated solely based on usage. For example, Rundeck has a minimum requirement of 2CPU & 8GB RAM, so it is necessary to ensure that values lower than these are not configured.
As a method for embedding the information that "specific resources must have certain specs" directly into the resource manifests, conftest can be effectively utilized.
When recording that Rundeck is deployed with specific specs in annotations:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rundeck
namespace: rundeck
labels:
app: rundeck
annotations:
# The Server Profile (Minimum) requires the following:
# - 2 CPUs per instance
# - 8GB RAM (4GB JVM Heap)
# https://docs.rundeck.com/docs/administration/install/system-requirements.html
"repo-name/explicit-cpu-request-rundeck": "2000m"
"repo-name/explicit-cpu-limit-rundeck": "2000m"
"repo-name/explicit-memory-request-rundeck": "8000M"
"repo-name/explicit-memory-limit-rundeck": "8000M"
# (Omitted)
With the following policy, you can trigger an error if the container's requests / limits do not match the explicitly stated specs. An example Rego policy for requests.cpu is provided below:
package pod
target_resource_types := {"DaemonSet", "Deployment", "StatefulSet", "Job", "CronJob", "ReplicaSet", "ReplicationController"}
deny_explicit_resource_spec_not_set_cpu_request[msg] {
target_resource_types[input.kind]
container := input.spec.template.spec.containers[_]
explicit_resource_spec := get_explicit_resource_spec("cpu-request", container.name, input)
explicit_resource_spec
not container.resources.requests.cpu
msg := sprintf("%s: CPU request for container %s is not specified against the explicit spec %s", [input.metadata.name, container.name, explicit_resource_spec])
}
deny_explicit_resource_spec_not_match_cpu_request[msg] {
target_resource_types[input.kind]
container := input.spec.template.spec.containers[_]
container_resouce_spec := container.resources.requests.cpu
explicit_resource_spec := get_explicit_resource_spec("cpu-request", container.name, input)
container_resouce_spec != explicit_resource_spec
msg := sprintf("%s: CPU request %s for container %s does not match the explicit spec %s", [input.metadata.name, container.name, container_resouce_spec, explicit_resource_spec])
}
get_explicit_resource_spec(resource_kind, container_name, resource) = explicit_resource_spec {
annotations := resource.metadata.annotations
resource_spec_key := sprintf("repo-name/explicit-%s-%s", [resource_kind, container_name])
explicit_resource_spec := annotations[resource_spec_key]
}
Regarding the calculation of recommended specs, while it depends on the number of components deployed in the environment, it is desirable to semi-automate the process using scripts or similar tools.
Node-Level Optimization
The following points should be carefully observed immediately after applying container resource optimization to the environment:
-
Is the cluster stable?
- Based on the criteria explained in the Monitoring section, check for Pods experiencing frequent restarts. Common issues include:
- Insufficient CPU per container due to excessive spec reduction or increased container density > CPU throttling occurs or node CPU utilization maxes out > LivenessProbe fails > Restarts increase.
- Memory shortage due to excessive spec reduction > OOM occurs > Restarts increase.
- Especially since containers that haven't been adjusted, such as those in
kube-system, may be affected if they don't have memory limits, you should check each failing Pod for potential issues.
- Especially since containers that haven't been adjusted, such as those in
- Increased container density reduces available ephemeral storage per container > Disk exhaustion > Eviction occurs > Restarts increase.
- Based on the criteria explained in the Monitoring section, check for Pods experiencing frequent restarts. Common issues include:
-
Is the cluster optimized?
- The following should change before and after application:
- Total CPU Requests and Memory Requests across the cluster decrease.
- If they don't, there might have been a mistake in the applied values.
- The {CPU, Memory} allocation rate per instance type increases.
- The higher this increases, the more node utilization has improved (meaning bin-packing efficiency has increased).
- It's possible for one to increase while the other stays the same or decreases.
- This indicates that the Memory/vCPU ratio of the containers deviates from the EC2 instance types being used.
- The Memory/vCPU ratio changes.
- You should select instance types that are appropriate for this new ratio.
- Total CPU Requests and Memory Requests across the cluster decrease.
- The following should change before and after application:
If you are using autoscaling with Cluster Autoscaler, worker nodes may be automatically reduced just by optimizing at the container level, potentially lowering costs. However, since the appropriate instance type for the environment may have changed along with the shift in the Memory/vCPU ratio, you should continue with worker node optimization.
The following diagram shows the total Memory Usage / CPU Usage of containers deployed on a specific worker node.
While it was 4GB / 1vCPU before the adjustment, it became 8GB / 1vCPU after optimization. Therefore, selecting an instance with specs like 32GB / 4vCPU is expected to increase bin-packing efficiency and make it easier to reduce costs.

Image of the observed ratios before and after spec changes
However, simply selecting instances based on this criteria alone increases the risk of throttling if CPU is overcommitted. Considering that M-family instances, which are suitable for standard workloads, have a ratio of 4GB / 1vCPU, it is likely better to avoid choosing instance types with extreme ratios.
Additionally, selecting instance types requires significant prerequisite knowledge, so it may be safer to understand their characteristics beforehand.
Once you have selected an instance type, apply it to a staging environment and test the applications.
It is recommended to check the following:
- Whether Pod restarts occur during high-load periods, such as initial processing during deployment or specific application operations.
- In cases where limits are not set, pay close attention as this can affect not only the high-load Pod but also other components running on the same worker node.
- Monitoring components like the Datadog Agent might also be affected, so it may be better to check the cluster status directly using commands like
kubectl get pod -A -owide.
Also, while it may seem that a configuration like one 64GB / 16core instance would be cheaper than two 32GB / 8core instances, if you are running applications on Spot instances, using a single instance increases the risk of service downtime due to Spot interruptions or AZ failures.
You will need to refine this based on resource usage and error status after container optimization.
Furthermore, it is desirable to configure components for distributed placement as a preceding step to resource optimization.
The following outlines the points to check when selecting instances.
It is easy to understand by opening the AWS Auto Scaling Group screen with Read-Only permissions and looking at Specify instance attributes under Instance type requirements.

Instance type requirements screen
Below are the points to note for each component.
- vCPU
- Since the number of cores determines the maximum number of containers that can run in parallel, it seems preferable to have at least 4 cores.
- If CPU is configured with
requests < limits, a low absolute number of cores makes it easier for the CPU to become exhausted.
- If CPU is configured with
- Since the number of cores determines the maximum number of containers that can run in parallel, it seems preferable to have at least 4 cores.
- Memory
- Increasing memory per instance too much to improve bin-packing efficiency increases the risk of downtime during Spot instance interruptions.
- This is because it takes a few minutes for a new Spot instance to spin up.
- Conversely, reducing memory too much increases the proportion of resources that run per node (such as DaemonSets), leading to poor cost efficiency.
- When operating a self-managed node group with Cluster Autoscaler for scaling, instance weighting cannot be configured (identical values must be specified), so it is desirable to keep the memory size consistent across instance types.
- Assigning different values between instance types causes a discrepancy between the number of instances Cluster Autoscaler expects and the actual count, which can lead to issues.
- While it is also ideal for the vCPU count to be consistent, doing so makes it difficult to meet the requirement of having 20 or more Spot pools (described below). Since many cases explicitly define
requests/limitsand the total cluster requirement for memory is easier to estimate, I believe it is better to fix the memory side (though I'm not entirely certain of the "correct" answer).- If the Auto Scaling Group is set to the price-capacity-optimized allocation strategy, it should basically pick the cheaper, lower-core instances if the memory is the same.
- Increasing memory per instance too much to improve bin-packing efficiency increases the risk of downtime during Spot instance interruptions.
- Number of instance types (≈ Number of Spot pools)
- As discussed in this case study, it is generally said that the risk of interruption is high if there are fewer than 20 pools, so we aim for at least
7 instance types * 3 AZs = 21 Spot pools.-
To configure all stateless servers with Spot instances, DeNA ultimately increased the number of Spot pools used to 20. For example, at one point, for five Availability Zones in the Northern Virginia region, they defined four instance types: c5.2xlarge, c5.4xlarge, c5d.4xlarge, and c5.9xlarge. The pool count is calculated as 5 x 4 = 20 pools.
-
- As discussed in this case study, it is generally said that the risk of interruption is high if there are fewer than 20 pools, so we aim for at least
Once you have estimated the instance types, check the cost per instance from the pricing tables.
As a prerequisite, note that Spot instance prices are not proportional to specs and are updated frequently, so this information should be treated as a reference only.
- Spot Instances
- On-Demand Instances

Price List
You can calculate the projected cost after implementation by comparing the specs and quantity of currently running instances with the specs and quantity of instances planned for the future.
References
Discussion