iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔍️

From Guesswork to Search: Transforming Troubleshooting with "Rough Bayesian Search"

に公開

"We're getting a 503 error, can you take a look?"

With just these words, everything stops.
You don't know where to start looking.
You try checking things as they come to mind, but you're way off the mark.

It's a common scene.
Why do we lose our way so much during troubleshooting?

Introduction

In this article, we will examine the reason why troubleshooting goes astray from the perspective of "Guesswork Thinking."
Then, I will introduce a specific thinking framework to reframe investigation from a "game of hitting the right answer" to an "exploration problem of narrowing down candidates."

TL;DR

  • Troubleshooting is an exploration problem of "narrowing down," not "hitting the mark."
  • By assigning probabilities, fragmented information is integrated into one.
  • The humility of including "Others" saves you from blind spots.

Target Audience

  • Those who find themselves not knowing where to start an investigation.
  • Those who feel the limits of relying on intuition or experience.

"Guesswork" Thinking Causes Straying

What is happening in your head when you start an investigation?

"Where is the cause?" "That component looks suspicious." "I feel like something similar happened before."

This is "Guesswork."
It is a gambling mindset, trying to hit the right answer in one shot.

The problem with guesswork appears when you miss.
"That wasn't it. So, what's next?" You start thinking all over again from scratch.
If you're unlucky, you'll end up investigating the same place multiple times.

"Guesswork" thinking is what makes us wander.

Preventing Straying through "Exploration"

The mindset at the opposite pole of "Guesswork" is "Exploration."

In exploration, you don't try to hit the right answer.
Instead, you narrow down the possibilities of "where the right answer might be."

In guesswork, a miss is a "failure." You go back to square one.
In exploration, a miss is "information." You are left with certain knowledge that "it wasn't here."

Every time you miss, the candidates decrease.
You are always moving forward. That is why you don't go astray.

Assigning Probabilities

The starting point of exploration is listing candidates and assigning probabilities.
By integrating your knowledge into numerical values, the haziness in your mind takes shape.

First, write down the path the request takes.
In the case of a 503 error, the request must be failing somewhere.
Visualize that "somewhere."

Client

Load Balancer

Envoy

Application

Next, assign probabilities to each candidate cause.

  • Client: Since 503 is a server-side error, the probability of it being the cause is low.
  • Load Balancer: Low, as it is a highly reliable managed service.
  • Envoy: Medium, as there is more room for configuration errors than with managed services.
  • Application: High, as our own code is more prone to breaking.
  • Others: Leave some margin for something that might have been overlooked.
Client               5%

Load Balancer       10%

Envoy               25%

Application         45%
-------------------------------
Others              15%

There is no correct answer for these numbers.
Assign them based on your intuition and knowledge.

The important thing is to reflect all the various information you have without omission and quantify it.
By doing this, the haziness in your head takes shape.

Practice: Hunting 503 Errors

Let's actually try it out.

Where to Start

Checking "from the place with the highest probability" is a bad move. There are several perspectives on what to prioritize:

  • Can significantly narrow down the cause: Checking a midpoint in the path tells you instantly whether the issue is in the first or second half. This is the same idea as a binary search.
  • Certain information can be obtained: For things where access logs are kept, checking them will yield definite information.
  • Low investigation cost: If there's a dashboard you can look at in 10 seconds, check it first even if the probability is low.
  • High probability: If multiple candidates satisfy the above, prioritize the one with the higher probability.

Which one to emphasize depends on the situation. There is no single correct answer.

This time, let's start with the Load Balancer, which is the boundary between the client and the server. As a transit point for requests, it balances the above perspectives well.

Round 1: Load Balancer Logs

   Client               5%

🔍 Load Balancer       10%

   Envoy               25%

   Application         45%
-------------------------------
   Others              15%

First, check the first half of the path. By looking at the Load Balancer logs, you can tell whether "the problem is in the first half or the second half."

Extracting the requests where the 503 occurred yielded the following information:

target_processing_time: 0.002
target_status_code: 503
elb_status_code: 503

From this, we can see that Envoy, which is behind it, returned a 503, and as a result, the Load Balancer returned a 503.
Also, since the response time was as short as 0.002 seconds, it can be inferred that an error occurred with almost no processing taking place.

The Load Balancer itself was operating normally. The problem lies further ahead.

Based on the information obtained, we update the probabilities.

✅ Client               5% →  0%
                     Request had arrived

✅ Load Balancer       10% →  0%
       ↓             Was forwarding normally
------ Suspicious from here on ------

   Envoy               25% → 30%

   Application         45% → 55%
-------------------------------
   Others              15% → 15%

With one check, the number of candidates was reduced from five to three.

Round 2: Envoy Logs

✅ Client               0%

✅ Load Balancer       0%

🔍 Envoy               30%

   Application         55%
-------------------------------
   Others              15%

Next, we look at Envoy, which is behind the Load Balancer. Envoy also has access logs, so we should be able to get some kind of information.

Checking the access logs yielded the following information:

{
  "response_code": 503,
  "response_flags": "UC",
  "duration": 1,
  "upstream_service_time": null
}

response_flags is UC.
This stands for "Upstream Connection termination," meaning the application terminated the connection. In other words, it indicates that when Envoy tried to send a request, the connection had been closed from the application side.

Updating the probabilities:

✅ Client               0%

✅ Load Balancer       0%

✅ Envoy               30% →  5%
       ↓ Was trying to forward normally
         (Leaving 5% considering the possibility of an Envoy bug)
------ Suspicious from here on ------

   Application         55% → 75%
-------------------------------
   Others              15% → 20%

The application is terminating the connection. We've narrowed it down significantly.

Round 3: Application Logs

✅ Client               0%

✅ Load Balancer       0%

✅ Envoy               5%

🔍 Application         75%
-------------------------------
   Others              20%

Is there really a problem with the application itself?

Let's check the application logs.

However, there are no logs for the corresponding request.

What does this mean?

  • The application did not cause an error.
  • The application did not receive the request in the first place.

But Envoy says, "The application terminated the connection."

It seems contradictory.

Round 4: Comparing Configurations

Since the connection area seems suspicious, let's compare the settings of Envoy and the application.

Envoy Settings

idle_timeout is set to 300 seconds. Upon investigation, I found that idle_timeout is a setting related to HTTP Keep-Alive.

clusters:
  - name: app_cluster
    common_http_protocol_options:
      idle_timeout: 300s

Application Settings

There were no explicit settings related to HTTP Keep-Alive on the application side. Looking it up, the default keepAliveTimeout for Node.js is 5 seconds. There seems to be a cause here.

Resolution: What Was Happening

Let's organize what was happening in chronological order.

  1. Envoy sends a request to the application.
  2. The application returns a response.
  3. The connection is maintained via HTTP Keep-Alive.
  4. 5 seconds later, the application terminates the connection, thinking "it probably won't be used anymore."
  5. Envoy keeps the connection in the pool, thinking "it can still be used for another 295 seconds."
  6. A new request arrives.
  7. Envoy tries to reuse the connection from the pool.
  8. The connection is already closed → 503.

Neither was wrong. They just weren't in sync.

✅ Client               0%

✅ Load Balancer       0%

✅ Envoy               0%

✅ Application         0%
-------------------------------
⚠️ Others             100%

Reflection: Why This Approach Worked

Now that the investigation is over, let's look back at this approach.

Why Thinking in Probabilities Worked

1. Fragmented knowledge is integrated

"Since 503 is a server-side error..." "Since it's a managed service..."

These were fragmented pieces of knowledge. They existed separately in the head and didn't answer the question, "So, where exactly should I start looking?"

By quantifying them as numerical probabilities, these are integrated into a single criterion for judgment. The values like "Client 5%, Load Balancer 10%, Application 45%" represent a comprehensive evaluation of those fragmented pieces of knowledge.

What was previously "I have a feeling the app is suspicious" turns into a calm judgment: "I am 45% confident the app is suspicious. However, there is a 55% chance it is something else."

2. Biases are visualized

"The log I just saw looked suspicious," "There was a similar failure before."

Without quantification, you get dragged along by such recent information. By quantification, you can see what you are emphasizing and correct it accordingly.

3. The next action is determined

You can't act on "Everything is suspicious." Because there were probabilities, it was determined "where to investigate" and "what information would change the probabilities."

The Margin Called "Others"

The answer was "Others."

An inconsistency in HTTP Keep-Alive settings between Envoy and the application. Both components were working correctly on their own. The problem lay in the "relationship."

This is why it couldn't be listed as a candidate initially. When listing candidates, we look at "components." However, problems can sometimes lurk in the invisible gaps "between components."

"Others" represents the humility to accept that there may be things you don't know or can't see.

It was precisely because of this margin of humility that we were able to reach the correct answer this time.

Remembering as Structural Tendencies

Once the investigation is over, you reflect. "How can I apply this learning to the next investigation?"

There is one note of caution here.

"Since there was an HTTP Keep-Alive issue recently, maybe it's the same next time." This is a bias. The sample size is too small. It might just be a coincidence this time.

Instead, remember it as a structural tendency. Add the perspective of "consistency of settings between components" to your future candidate listings. Also, improve the accuracy of initial probabilities based on the information learned.

This is what it means for "experience to improve the starting point of the next investigation." Rather than memorizing specific causes, you increase the perspectives and information you should look at.

What we have done so far is Rough Bayesian Search.

Bayesian search is a method of "searching while updating the probability of each candidate every time new information is obtained." You start with a probability distribution of "where it is likely to be" and update that distribution according to the investigation results. It is famous for its use in maritime search and rescue.

The numbers can be rough. The important thing is that by assigning numbers, you gain a bird's-eye view of the whole.

Fragmented knowledge, experience, and intuition become a single map. With that map, you will no longer get lost in "where to start investigating."

Summary

Troubleshooting is not "Guesswork."

Don't try to "hit it." "Narrow it down." It is an exploration problem of efficiently narrowing down the candidate space.

The steps are simple:

  1. List the candidates.
  2. Assign probabilities.
  3. Gather information that can move the probabilities the most.
  4. Update and repeat.

If you're wondering "where to look," try writing down the candidates and assigning probabilities. Don't forget "Others." Humility prevents assumptions.

In that moment, investigation transforms from "Guesswork" into "Exploration."

Discussion