iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌊

Chapter 5: Failure Design and Operational Patterns in RML-2 — Exceptions, Observability, and Governance

に公開

Chapter 5: Failure Design and Operational Patterns in RML-2 — Exceptions, Observability, and Governance

"The Worlds of Distributed Systems" Chapter 5


"Should we just catch all exceptions and show the user,
'An error occurred, please try again'?"

Doing this in a distributed system will set the world on fire.

In Chapter 1, we categorized rollback strategies into three worlds:

  • RML-1 — Closed World
  • RML-2 — Dialog World
  • RML-3 — History World

In this chapter, we focus on RML-2 (Dialog World), which is the one where teams most frequently trip up. We will cover:

  • Classifying exceptions and failures based on "which world they belong to."
  • Deciding "where to catch them and how far to propagate them" based on that classification.

We will summarize these as design patterns.


1. Organizing the "World of Failure" First

If we map the RML concepts to "failure," the landscape looks roughly like this:

World Type of Failure Scope of Impact Example
RML-1 Failure confined to a single process Local memory / temporary files Out of memory, internal validation failure
RML-2 Failure during dialog with another service/user Self + neighbor services + partial users Downstream service timeout, saga failure mid-flow
RML-3 Failure that should be treated as "history" Reality shared with the org or society Double payment, misdirected email, compliance breach

This chapter deals with failures occurring in the RML-2 world.

In RML-2, when an exception occurs, you need to decide on a pattern for:

  • "Can I resolve this within myself?"
  • "Should I roll back this conversation (saga)?"
  • "Should I escalate this to RML-3?"

If these are left ambiguous, you end up with a critical mismatch where:

You treat an RML-3 level incident as if it were resolved by an RML-1 local fix.

This is the mismatch that forces SREs to pull all-nighters.


2. RML-2's Three-Layer Model: Local / Dialog / History-bound

When looking at exceptions in the RML-2 world, it is easiest to categorize them into these three layers:

  1. Local Failure
  2. Dialog Failure
  3. History-bound Failure

2.1 Local Failure — Still Confined to the Closed World

  • Failures at a stage where nothing has leaked to the "outside world" yet.
  • Input validation, internal state inconsistency, or a single item error within a batch.

These are handled the same as RML-1:

  • Roll back locally (retry or abort).
  • Log the issue.
  • Return a "standard error" to the caller.

Treating failures at this level as "major incidents" will quickly exhaust your operations team.

2.2 Dialog Failure — Failure Occurring Within a Conversation

  • RPC to a downstream service timed out.
  • Failure occurred at the 3rd step of a saga.
  • Tried to publish an event, but the broker was down.

These failures must be handled as part of the dialogue.

  • "What should the saga do as a whole?"
  • "Should I trigger a compensation transaction?"
  • "Is a simple retry on the spot enough?"

Decisions like these are required. If you just:

catch (Exception e) and return 500

the "conversation" breaks midway, and the saga's consistency is destroyed.

2.3 History-bound Failure — Failure Destined for History

Even if you designed for RML-2, you may encounter cases where:

  • Payment has already been confirmed externally.
  • Email or notifications have already reached the user.
  • Legal/contractual obligations mandate that "the cancellation itself must be recorded."

This level should be sent to RML-3.

  • Since you cannot "erase the past," you must accumulate "correction events" such as:

    • Refunds
    • Corrections
    • Notifications (explaining to the user)

What you see in your RML-2 code might be an "exception," but it is crucial to recognize that in terms of handling, it is already in the RML-3 world.


3. Labeling Exceptions: Error → (World, Severity, Action)

How should we handle this in actual code? A safe starting point is to "structure" the exceptions.

At the extreme, imagine treating exceptions with a type like this:

type World = "RML1" | "RML2" | "RML3";

type Severity = "info" | "warn" | "error" | "critical";

type ActionHint =
  | "retry-local"        // Safe to retry on the spot
  | "retry-with-backoff" // Retry with exponential backoff
  | "start-compensation" // Start saga compensation
  | "escalate-history"   // Escalate to RML-3
  | "abort";             // Abort immediately

type StructuredError = {
  world: World;
  severity: Severity;
  action: ActionHint;
  code: string;
  message: string;
  cause?: unknown;
};

While it can be difficult to achieve such a clean form in production code, even just explicitly defining world (which world is this failure in?) and action (what should the caller do?) makes it possible to discuss exception handling based on the "worldview."


4. Typical Exception Handling Patterns in RML-2

Let's look at some common handling patterns in the RML-2 world.

Pattern A: Resolve Local Failures On-the-Spot

The simplest pattern:

  • Input validation errors
  • Assertions within your own process
  • Stages where no external effects have yet occurred

These are exceptions that can be treated as RML-1.

if (!input.isValid()) {
  throw new StructuredError({
    world: "RML1",
    severity: "warn",
    action: "abort",
    code: "INVALID_INPUT",
    message: "Invalid input",
  });
}

The caller can simply:

  • Return the validation error to the user
  • Treat it as "never started" for the saga as a whole

There is no need to escalate these to the dialog level.

Pattern B: Handle Dialog Failures in the "Saga Context"

This is typical of the RML-2 world.

async function reserveStock(orderId: string): Promise<void> {
  try {
    await stockService.reserve(orderId);
  } catch (e) {
    throw new StructuredError({
      world: "RML2",
      severity: "error",
      action: "start-compensation",
      code: "STOCK_RESERVE_FAILED",
      message: "Failed to reserve stock",
      cause: e,
    });
  }
}

The caller (the saga executor) triggers the compensation flows for previous steps based on world === "RML2" && action === "start-compensation".

What is crucial here is to "decide how to handle this failure during the saga design phase" rather than just saying, "It failed, so return 500 and end it."

Pattern C: Return History-bound Failures with "Escalation Prerequisites"

Some errors visible in RML-2 code are clearly "history-bound":

  • "This transaction is already settled" from a payment gateway
  • "Past the cancellation deadline" from an external system
  • "Operation prohibited" from a compliance API

It is dangerous to try to force a reconciliation using the saga's undo() for these.

async function cancelPayment(paymentId: string): Promise<void> {
  try {
    await paymentGateway.cancel(paymentId);
  } catch (e) {
    if (isAlreadySettledError(e)) {
      throw new StructuredError({
        world: "RML3",
        severity: "critical",
        action: "escalate-history",
        code: "PAYMENT_ALREADY_SETTLED",
        message: "Cannot cancel because payment is already settled",
        cause: e,
      });
    }
    // Otherwise, treat as a temporary RML-2 failure
    throw new StructuredError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_CANCEL_FAILED",
      message: "Failed to cancel payment (retryable)",
      cause: e,
    });
  }
}

The point is to explicitly branch:

  • world === "RML3" exceptions move to the world of incident management, the Effect Ledger, and legal/compliance responses.
  • world === "RML2" exceptions move to the world of retries or saga compensation.

5. Typical Anti-patterns

Conversely, here are some patterns you should avoid.

5.1 Using catch-all to return 500 for everything

try {
  // various operations
} catch (e) {
  logger.error(e);
  return res.status(500).json({ message: "Internal Server Error" });
}

Doing this hides several issues under a single "500" status:

  • Simple RML-1 mistakes
  • Saga mid-process failures in RML-2
  • Accidents that should be escalated to RML-3

What happens in production:

  • It looks like a "successful rollback" at first glance.
  • But in reality, RML-3 accidents silently accumulate.
  • One day, they are discovered all at once, leading to a nightmare.

5.2 The "Just Retry" Loop

  • Automatically retrying even for RML-3 level failures.
  • Failures continue to pile up, causing DoS-like load on external systems.
  • In the worst case, repeating the same erroneous operation multiple times.

Whether or not a "retry will fix it" depends on the worldview:

  • RML-2 temporary failure → Retries are effective.
  • RML-3 settled error → Retries will not help (they might even make it worse).

5.3 Not defining the worldview in exceptions

  • Even reading the error message, you cannot tell "which world's failure this is."
  • The caller interprets it as "probably something like this" on their own.
  • As a result, RML-2 code hides RML-3 level accidents.

6. Connecting with Observability — Linking world tags to SLOs/Alerts

The world and severity in StructuredError are also highly compatible with prioritizing operations.

For example, if you are using Datadog or OpenTelemetry, you can simply attach them as:

  • Log fields
  • Metrics or trace attributes
logger.error({
  err: structuredError,
  world: structuredError.world,
  severity: structuredError.severity,
  action: structuredError.action,
  code: structuredError.code,
});

In the case of OTel, you would attach them as attributes to a span:

span.setAttribute("rml.world", structuredError.world);
span.setAttribute("rml.action", structuredError.action);
span.setAttribute("rml.code", structuredError.code);

By doing this, the SRE team can establish operational policies like:

  • RML-3 critical errors wake you up at night

    • Using PagerDuty / Opsgenie rules:

      • Require on-call if rml.world = "RML3" AND severity = "critical"
  • RML-2 error items are handled during business hours

    • At the level of "not spreading like wildfire, but will cause trouble if ignored."
  • RML-1 warn items can wait until the next morning

    • Accumulated as metrics to monitor trends in local failures.

By having tags based on the worldview,

You are not stuck with the binary choice of "alert on everything or ignore everything,"
but can perform finer adjustments like "which world's alerts should be sent to whom and during which hours."

When talking to the SRE team, it makes the conversation much easier if you decide on the thresholds—such as which combinations of world and severity map to SLOs/SLAs, or which worlds should "not wake anyone up at night"—alongside the RML table.


7. Enforcing ActionHint — Building mechanisms to ensure compliance

ActionHint is a hint indicating "what the caller should do in response to this error," but

Hints are often ignored.

As pointed out in anti-pattern 5.2,

  • Automatically retrying even though the ActionHint says escalate-history
  • Not implementing compensation logic even when it says start-compensation

Nightmares like these cannot be prevented by design alone. Here are some mechanisms for "enforcing compliance."

7.1 Offload logic to client libraries

One approach is to encapsulate the logic that follows the ActionHint within the client library.

async function callWithRmlHandling<T>(f: () => Promise<T>): Promise<T> {
  try {
    return await f();
  } catch (e) {
    const err = toStructuredError(e);

    switch (err.action) {
      case "retry-local":
        return await f();
      case "retry-with-backoff":
        await sleep(backoff());
        return await f();
      case "start-compensation":
        await startCompensationFlow(err);
        throw err;
      case "escalate-history":
        await notifyIncident(err);
        throw err;
      case "abort":
      default:
        throw err;
    }
  }
}

On the application code side, establish a rule to pass through a "common entry point" as much as possible:

await callWithRmlHandling(() => someRmlAwareApiCall());

This makes it harder to:

  • Perform ad-hoc retries while ignoring the ActionHint
  • Swallow escalate-history signals

7.2 Embed world information in API responses

Another pattern is to include RML information in HTTP responses.

HTTP/1.1 409 Conflict
Content-Type: application/json
X-RML-World: RML3
X-RML-Action: escalate-history
Retry-After: 0

{
  "code": "PAYMENT_ALREADY_SETTLED",
  "message": "Cannot cancel because payment is already settled"
}

By putting RML1/2/3 in X-RML-World and retry-local / retry-with-backoff / escalate-history in X-RML-Action,

common layers such as:

  • API Gateway
  • BFF (Backend for Frontend)

can perform monitoring like:

  • "Why is the client performing infinite retries when it's RML3 + escalate-history?"

7.3 Testing and Linting as Governance

Finally, this is a matter of culture:

  • Introduce Linting to detect code that does not interpret ActionHint.
  • Have tests fail if they return an RML3 exception to the user as a 500 error.

By including automated checks, you can make it easier to prevent the issue of:

"I knew about it, but I was too busy to address it."


8. Practical Checklist

Lastly, here is a quick checklist for designing exceptions in RML-2.

8.1 Designing Exception Classes / Error Codes

  • Which World (RML-1/2/3) should this failure be treated as?

  • What action is expected from the caller?

    • Abort on the spot
    • Retry
    • Start saga compensation
    • Escalate to RML-3
  • Do the logs and traces capture "world," "action," and "code"?

  • Does the monitoring and alerting configuration reflect the priority of "world" / "severity"?

8.2 Designing Sagas / Workflows

  • For each do/undo step:

    • Are RML-1, RML-2, and RML-3 level failures linguistically defined?
  • Is the handling (human op/incident) decided for when an "RML-3 failure occurs at this step"?

  • Is there a common handling layer (e.g., client library) that follows ActionHints?

8.3 Designing Endpoints / APIs

  • What World is this API basically assuming?

    • Pure RML-1 assumption (internal only)
    • RML-2 dialog assumption
    • Essentially RML-3 (e.g., external payment)
  • When an RML-3 level failure occurs:

    • Can it be identified via status codes or response headers?
    • Does it leave sufficient information in logs/audit trails?

9. Conclusion — "Separating exceptions by world" makes design conversations easier

When handling exceptions in the RML-2 world, the point is simple:

View exceptions
not as "technical failures"
but as "events occurring in a specific world."

  • RML-1 failures can be resolved locally.
  • RML-2 failures should be handled as sagas or dialogues.
  • RML-3 failures should be acknowledged as history, with proactive corrections accumulated.

Once this "worldview-based labeling" is in place,

  • Conversations between engineers
  • Conversations between engineers and business/legal/SRE teams

will become significantly easier.

In future chapters (tentative), I plan to delve deeper into the specific needs when stepping into the RML-3 world:

"Organizational theory in the RML-3 era (collaboration between legal and engineering)."

Discussion