iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article

A Summary of 'Production Load Testing as a Guardrail for SLOs'

に公開

This is the 9th day of the Translation of Overseas SRE-related Sessions in the Qiita Advent Calendar 2025. Although it's a solo Advent Calendar, I'll do my best to complete it.

In this Advent Calendar, I translate overseas SRE-related sessions and add my own opinions, questions, and supplementary information. Please point out any sections where my misunderstanding has led to incorrect explanations in the comments.

Previously, I posted the Translation of Corporate Open Source Anti-Patterns: A Decade Later. It discussed common anti-patterns that companies owning OSS tend to fall into, from the perspective of social contracts.

In this Advent Calendar, I will distinguish my opinions and comments from the summaries as much as possible using the notation below. However, I ask for your understanding that it's not always strictly possible, and my own thoughts might inadvertently mix into the summaries. (Please point it out if you notice it.)

About the Session Introduced

https://www.youtube.com/watch?v=Y20K1mJB6tk

Session details: Production load testing as a guardrail for SLOs | SLOconf 2021

This session is about load testing in a production environment. As mentioned in the session, the phrase "SLO-driven load testing in production" felt very fresh to me.

Summary

Background: Complex and Frequently Changing Production Environments

The system managed by the new SRE team I supported was very complex, being modern yet intertwined with third-party integrations and legacy requirements. The development team had a culture of rapid delivery, with frequent production releases being the norm. As a result, the entire stack was prone to instability, leading to chronic on-call fatigue.

What was most serious was that "metrics were being collected, but no one knew how the user experience was deteriorating." It was clear that CPU and memory were under strain. However, it was unclear how this affected the user experience or which endpoints were truly problematic. Consequently, many of the alerts ringing in the middle of the night were "noise that didn't actually impact users."

Introducing SLOs to Regain Focus on "What's Important"

What was introduced then was SLO. First, we defined SLOs for user experience and then propagated them to services and components.

For example, previously an alert would fire simply because "Service X's container restarted 10 times in 15 minutes." However, after introducing SLOs and redesigning alerts based on user impact, situations where "restarts are fine from an SLO perspective" became visible, and unnecessary alerts disappeared. The need to wake up in the middle of the night diminished, and there were more situations where things could be checked the next morning. We were able to concentrate only on what was truly important.

However, getting to that point was the most challenging part. Spreading the understanding of SLOs to development, product, and management teams, improving metric quality, and establishing a review culture—this was a massive undertaking, both technically and socially.

SLOs Are Not the End: How to Guide "Evolution"

SLOs were defined, internalized by the team, and operational mechanisms were in place. So, what was next? The question that arose here was, "How do we guide the evolution of complex systems over time?"

It was here that I encountered the concept of fitness functions from evolutionary architecture. It makes sense for software to also have functions that indicate desired directions of evolution. So, what is a fitness function in the context of SLOs?

Their answer was this:

Use production load testing as a fitness function for performance-related SLOs

Load test in prod: Why production?

The idea of conducting load tests in production is often taboo in many organizations. However, my argument is clear:

  • Systems inevitably degrade and slowly drift away from SLOs.
  • It's safer to intentionally apply small stress to anticipate problems before they lead to major failures.
  • The error budget is an asset to be used for risky changes, and using it for load testing is reasonable.
  • This also aligns with the concept of anti-fragility: adapting to small shocks makes one resilient to large shocks.

In other words, load testing is not a destructive act, but rather a guardrail to protect SLOs from future major failures.

Even in a world where "shift left" (performing tests earlier in the development process) is appropriate for unit tests and E2E tests, load testing is an exception. There are simply too many phenomena that can only be observed in production. This is why it is an approach similar to chaos engineering.

Start small: SLO-driven production load testing

Minimizing risks, they started as follows:

  • Ran production load tests once a week
  • Applied load up to 25% above the peak traffic of the past 7 days
  • Performed no manual pre-adjustments whatsoever
  • Controlled to not consume more than X% of the SLO error budget
  • Made tests stoppable within 30 seconds

Furthermore, they tested endpoints that could affect the highest-level SLOs first.

The results were astonishing:

  • In many cases, SLOs remained unaffected even with increased load
  • It was confirmed that safety margins still remained, even with significant system changes
  • Other teams approached them, wanting to conduct similar load tests
  • I told them, "Then, let's define SLOs first."
    → This accelerated the adoption of SLOs across the entire organization

Production load testing did not induce new failures; rather, it became a mechanism to foster a culture of improving reliability within the organization.

Conclusion: Production load testing as a fitness function

In summary:

  • Once SLOs are set, safety margins must be created to protect them.
  • The concept of a fitness function is extremely powerful for this purpose.
  • Production load testing can be an excellent function for measuring the fitness of SLOs.
  • The important thing is to “just start.” Small, carefully, but surely.

Production load testing is scary. However, this session quietly convinces us that for complex systems, it is the most realistic and powerful means to protect SLOs into the future.

Overall Comments

“Load testing in prod” is something I am personally very interested in. I listened to Abema’s case study session at SRE NEXT 2025, talked with them afterward, and even recently met with them again to discuss it further. It's a topic that genuinely fascinates me.

This session is from 2021, but it was far ahead of my current understanding. The point about starting production load testing with one team, and then other teams wanting to emulate the effort, which in turn accelerated SLO adoption, was incredibly insightful.

Next time, I will be translating "99.99% of Your Traces Are (Probably) Trash," a session about tracing data sampling strategies.

Discussion