iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article

A Free Translation of 'The Keys to SRE' at SRECon14

に公開

This is the first entry for the Overseas SRE-related Session Interpretation in the Qiita Advent Calendar 2025. It's a solo Advent Calendar, but I'll do my best to complete it.

In this Advent Calendar, I will translate SRE-related sessions from overseas, adding my own opinions, questions, and supplementary information. If there are any parts that are incorrectly explained due to my misunderstanding, please point them out in the comments.

In this Advent Calendar, I will distinguish my opinions and comments from the summaries as much as possible using the notation below. However, please excuse me if my thoughts sometimes blend into the summaries, as I cannot be completely strict about this. (Please point it out if you notice it)

About the Session Introduced

The first session is SRECon14 The Keys to SRE, which can be called the beginning of SRE.

https://www.youtube.com/watch?v=n4Wf14e2jxQ

Session Details: Keys to SRE | USENIX

Ben Treynor is the founder of Google SRE, and a leader and manager who grew Google's operations team from 7 to over 1200 people. When discussing the definition and principles of SRE, many sources cite this session.
For example, SRE Workbook - O'Reilly Japan also quotes it at the beginning.

Summary

The Starting Point of SRE: "The most important thing is that the product is working properly"

  • The most crucial thing for business is that "the product works."
  • Management tends to prioritize reliability only after a major incident occurs and causes an "outage."
  • However, by then, structural problems have accumulated to the point where it takes several months to a year to fix them.
  • This is why an "independent team" responsible for reliability, i.e., SRE, is considered necessary.

Traditional Structural Conflict of Dev vs Ops

Traditional Ops team:

  • Goal: Not to break things (stable operations)
  • Rule of thumb: The most common time for things to break is "when changes are made."
    • → Tends to lead to the approach of "if we stop making changes, it won't break."

Meanwhile, Dev team:

  • Goal: Release new features quickly
  • If changes are halted, the business cannot move forward

This gap leads to a typical "gate culture":

  • Launch review & deep dive
    • Based on past incidents, gates increase with items like "these three things weren't done, so an incident occurred; let's make them checklist items."
    • Detailed analysis of the product leads to an extensive checklist of "points that could break."
    • A TPM is needed to manage this, and checklist items continue to grow.
  • Overly cautious canary releases
    • Attempts to reduce risk by any means necessary, such as 1% canary for a month, then region by region for several weeks.

However, doing this leads to:

  • Devs trying to bypass the gates by saying things like:
    • "This isn't a release, just flipping a flag ON/OFF."
    • "It's a new feature, not an existing one, so it doesn't apply."
    • This leads to them creating "paths to slip through the gates."
  • And then, Ops, who understand the code less than Devs, ends up judging the safety of a release, creating a twisted situation.

This is not a people problem but an "incentive design problem in the structure."

Self-regulation through SLO / Error Budgets

Google SRE stopped using the method of "stopping and safeguarding releases with checklist reviews." Instead, they use SLOs and error budgets.
First, an SLO (e.g., 99.9% availability) is decided for a service.

  • The error budget is 1 - SLO.
  • Example: If the SLO is 99.9%, the remaining 0.1% is the "acceptable margin for breaking." If there are 1 billion requests per month, approximately 1 million errors are considered "within budget."

The release policy is very simple.

  • If within the error budget, releases are allowed.
  • As long as the SLO is met, developers are considered to be "doing a good job," and SRE trusts them.
  • If a major incident causes the SLO to be breached and the error budget is exhausted:
    • New feature releases are halted, and focus shifts to improving reliability (bug fixes, enhanced testing, design improvements).

Key points so far:

  1. Acknowledge information asymmetry
    • The development team knows the code best.
    • Therefore, within the scope of SLA/SLO, maximize the discretion of the development side.
  2. Self-adjustment mechanism
    • If messy changes exhaust the error budget, their own development pace will slow down.
    • As a result, developers themselves begin to "regulate their own releases."
  3. SLOs are a business lever
    • If users complain heavily or management is in a "hair on fire" state, the SLO can be raised.
    • This is a product decision problem, not a technical one.

→ This eliminates the Dev vs Ops conflict itself, allowing conversation through common metrics (SLO/error budget), which is one of the "Keys."

Operation Overload and "Six Guidelines"

Error budgets reduce conflict, but another problem arises.

SREs become "just an operations team" due to too much operational work (manual tasks, toil).

To avoid this, Treynor presents roughly the following six guidelines.

  1. SRE and Dev share a common personnel pool (One more SRE = One less Dev)

If 50 new engineers are hired, the decision of how many to assign to SRE and how many to Dev is considered within a single pool.
This means that adding one SRE implies that features that person would have developed as a Dev will not exist.

Conversely, management is made to understand that the more operational work is reduced, the more resources (features) can be allocated to development.

  1. SRE hires "software engineers only"

Not mere script writers, but engineers who can design systems and automate. We seek people who don't mind "doing the same manual task 30 times," but rather those who get bored after the second time and think, "I can't do this without automation." In short, we gather people who "enjoy automating toil."

  1. Time spent on Ops is capped at 50% (ideally around 30%)

A maximum limit is set for the team's time that can be spent on operations (on-call, ticket handling, manual work). In reality, maintaining it around 30% is considered "quite good."
Exceeding this is treated as an alert for "operational overload."

  1. Development teams also participate in on-call/operations (minimum 5%)

Devs are made responsible for at least 5% of the operational workload, especially on-call. This ensures that the reality of operations is constantly communicated to the Dev side (a "warmed up" state). It also smooths out operational transfers when overflowing operational tasks to subsequent Devs.

If there are compliance requirements such as SOX, it's necessary to negotiate with management to design the rules.

  1. If operational tasks exceed 50%, overflow them to Dev

If SRE's Ops ratio exceeds 50%, those operational tasks are reassigned to the Dev team.
If SRE is overwhelmed by operations, development speed must be reduced to address reliability improvements, otherwise, SLO achievement will eventually be at risk.
As a result,

  • Devs dedicate time to operational work → Fewer changes → Lower frequency of new incidents
  • Further code modifications reduce operational work → Increased stability → Increased error budget, allowing releases again
  1. SRE Portability (SREs are portable)

SREs should be free to move to other projects, other organizations, or even to Dev if they wish.
They are not kept bound to "broken operational environments" or "environments full of toil with no learning or growth."
If a situation is irredeemably bad, the SRE team itself may be disbanded and moved to other healthy teams.
This also sends a strong message to management: "If things continue like this, the ticket queue will keep growing, which will be a loss for management too."

The nuance is that they are determined to the point where "if necessary, the issue will go all the way to the top (Larry), and support will definitely be provided there."

Keys to Incident Response: Minimizing Impact and Preventing Recurrence

No matter how hard you try, incidents will never be zero.
The two important things are:

  • Shorten the time to detect and fix (MTTR)
  • Reduce the scope of impact (number of affected people)

Key points mentioned for this include:

  • No NOC
    • Do not rely on first responders (NOC) who can only follow procedures.
    • Send alerts directly to engineers with the skills to solve the problem.
    • This reduces the number of escalation steps and time.
  • Good Alert Design
    • Alerts like "Something seems wrong in this cluster" are not helpful.
    • Aim for alerts that show "who should do what."
  • Practice, practice, practice (Wheel of Misfortune)
    • Actual post-mortems often show that on-site response times are "3 times the ideal."
    • "Skilled engineers should find the correct solution instantly" is a myth.
    • Therefore, they conduct training called the Wheel of Misfortune:
      • Role-play divided into GM (Game Master) and responders.
      • Participants gather information by asking questions and respond based on operational and design knowledge.
      • Other members mostly observe (only assisting when asked).
      • Like Dungeons & Dragons, the GM provides situational updates like "This is what the dashboard looks like now," followed by a debrief.

Post-mortem Culture

Key points for reflecting after an incident:

  • Focus on process and technology (not blaming people)
  • Create a timeline and collect facts
  • All follow-up tasks are created as bugs or tickets

Honestly share mistakes,
and discuss "how to change the system so that the same mistake is less likely to lead to an incident."

"I cannot emphasize this point enough."
This strongly promotes a blameless learning culture.

Overall Comments

This video is 11 years old, and I regret putting off watching it for so long when I could have seen it anytime. That's how much I had misunderstood some SRE practices, and in a way, underestimated SLOs and error budgets. If seen as a complete replacement for cumbersome release checklists, it's very appropriate, but if it can't replace them, its effectiveness might be limited, I thought.

In environments where Dev and Ops are not separated (DevOps-native, for example), I've noticed that central organizations (not Ops) tend to maintain release checklists in the name of governance. In such cases, I realized that I would need to attempt an incentive redesign with them.

The second day of the Advent Calendar is SLO Math in SLOconf 2021 Interpretation.

Discussion