iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🙆

What It's Like to Work in Software Support

に公開

Introduction

There are many different types of jobs in the world of software. From the perspective of someone who knows nothing about it, the one that tends to get the most limelight is the developer, especially the programmer. There are many manga, anime, and movies featuring "super programmers." However, other roles are less known because they are difficult to imagine without actually doing them. In this article, I will focus on the Support Engineer among those roles and describe what this job entails. The target audience is developers, especially novice developers who have only experienced making things that they or people close to them use.

While there are various types of support engineers, here I will use a very simple definition: an engineer who receives requests from SEs (System Engineers) to solve problems when trouble occurs in a customer's system at an SI (System Integration) company. There are cases where a developer also serves as a support engineer and cases where they do not, but in this article, either is fine.

I once spent several years at an SI company developing and supporting the Linux kernel for large systems that support social infrastructure. This article is written based on what I experienced and heard during that time. Since my experience is within a very narrow scope, there are likely many environments where the content of this article does not apply, but I hope you can view it from the perspective of how things look to at least one person with support experience.

The motivations for writing this article are as follows:

  • I want people to know what support work is and what makes it attractive.
  • I want people to understand that development and support complement each other, and that each requires a different set of skills.
  • Based on the above, I want to do something about the atmosphere [1] where support is considered easier or a step below development.

The introduction has become quite long, so I will get into the specific details in the next section.

Understanding the Values of Customers and SEs

There are many instances where conversations between a support engineer and others completely fail to connect because the values of the customer and the SE (hereafter referred to as "customer values," assuming the customer and SE share the same values) differ from those of the support engineer. It's a bit extreme, but let's consider an example of Person A, an SE, and Person B, a former developer who recently became a support engineer. Their values are as follows:

  • Person A's values: It is preferable to continue the service without trouble.
  • Person B's values: It is preferable to write beautiful code using a favorite programming language.

You can already sense the trouble brewing, and this mismatch leads to the following terrifying conversation:

  • Person A: "Can you tell me the current status of the problem I reported the other day?"
  • Person B: "(Showing the source code) The problem is that this bar method in the foo class is O(n), so it doesn't scale when n increases."
  • Person A: "??? When do you expect the problem to be resolved?"
  • Person B: "The code here is 'rotten,' so it's better to spend about a week fundamentally rewriting it."
  • Person A: "Could we somehow minimize the impact and perform a temporary fix in a short period?"
  • Person B: "I don't want to do something so 'uncool'."
  • Person A: "???"

The issue here is that while the user of the system is ultimately the customer, Person B is ignoring the customer's values and only saying what they like. It is common for programmers who are considered excellent to be completely unable to perform support work, and a significant number of them fall into this pattern.

To avoid this, Person B needs a shift in mindset: putting themselves in the other person's shoes and placing the customer's values first and their own values second. If someone absolutely cannot or does not want to do this, it is better to leave support to someone else. However, at that time, rather than having an arrogant thought like "Person A, who doesn't understand my values, is inferior," it would be more constructive to think, "I can write code, but I can't do support. The support engineer and I should complement each other."

Minimize the Impact on the System

Suppose a problem occurs after a software version update. For simplicity, let's assume there are 100 commits between the previous version and the current version, and the reproduction procedure is clear. In this case, one of the straightforward and highly accurate methods to find the root cause is "bisect." Specifically, you first perform a reproduction test at the 50th commit from the previous version, and if the problem occurs, you then test at the 25th commit, repeating this process to pinpoint the commit that caused the problem in logarithmic order[2].

Can you try this binary search on a customer's production system? In most cases, the answer is no. This is because swapping software versions on a customer's system means the system stops each time, which in turn means the customer's service stops. Therefore, it is unlikely they will allow you to try such things multiple times. Similarly, you won't often be allowed to enable logs or traces that cause significant performance degradation, which are usually only used in development environments. For this reason, support engineers must always have many means of isolating problems and understand the conditions under which they can be used.

Requests Should Be Specific, Minimum in Number, and Clear in Intent

You have probably received support for some product you own at least a few times. Think about that experience. How would you feel if, on top of being unhappy because a problem has occurred, you were given one request after another? You would probably start to get frustrated. The same can be said for computer systems.

The more times you make requests to an SE, the more you amplify their negative emotions. Furthermore, the more unclear the intent of the request, the more the SE's distrust of the support engineer increases. To avoid this, it is necessary to keep the number of requests to the SE to a minimum and to clarify what you want them to do and for what purpose. Discard the naive idea that you want them to read the intent behind the support engineer's request and respond accordingly.

Let me elaborate a bit more on the number of requests. To proceed with an investigation while the number of requests is limited, you can use methods such as the following:

  • Take logs at key points during system operation
  • Similarly, collect metrics
  • Obtain a core dump (a file that holds the contents of memory at the time of an abnormal termination) when the system crashes

A support engineer needs detective-like abilities to deduce the root cause from such circumstantial evidence. Additionally, there is the constraint of having to minimize the number of requests and the scope of impact at that time. This kind of ability is different from programming ability, and it is common for excellent programmers to be poor at troubleshooting (and vice versa).

Since it is difficult to set these up and try to reproduce the issue after the fact, it is best to make it possible to collect them at all times whenever possible.

Regarding logs and metrics, although they are collected by developers, a support perspective is necessary to determine what kind of logs and metrics would make an investigation easier. Therefore, development and support should work closely together to improve service levels. In addition, I believe developers should experience support work at least once to acquire a support perspective. In fact, I constantly feel that I was able to grow significantly as a software engineer by experiencing both.

Investigate Based on Records, Not Memory

When making inquiries to an SE, you must speak based on software output as much as possible. This is because people unconsciously tell lies even without any malicious intent. The most famous example is the one that goes, "It broke even though I didn't do anything." A typical conversation looks like this:

  1. SE: "Problems started occurring yesterday."
  2. Support Engineer: "Did you do anything around the time the problem occurred?"
  3. SE: "I didn't do anything."
  4. Support Engineer, exhausted from the investigation days later: "(The software version has changed...)"

To avoid this, verification should rely on records such as software logs rather than the memory of the SE or other involved parties. For example, in the above Q&A, it would have been better if the support engineer had said something like, "Please show me the software update logs." Based on this, you should understand where and what kind of records are left when the state of the system changes.

When Support is Divided into Multiple Tiers

In the previous sections, I assumed that an SE sends a request for investigation directly to a specific component when a problem occurs. However, once a system reaches a certain scale, support is almost always organized into multiple tiers. Specifically, when an SE requests an investigation from a primary support engineer, they perform initial isolation and then further distribute the investigation to the support engineers of the components that seem to be involved. The reason for doing this is that, for example, if a system consists of two components A and B, it would be inefficient if every problem report was sent to the support engineers of both A and B.

A primary support engineer needs a completely different set of skills than a developer or a support engineer for a single component. First, rather than deep knowledge of a specific component, they need broad and shallow knowledge of multiple software products within their range of responsibility[3].

Primary support engineers deal with a large number of stakeholders. They need to handle many tasks in parallel: making requests to support for each component based on reports from the SE, obtaining answers, sending reminders in some cases, and then reporting back to the SE. In some situations, they may need to organize and coordinate a meeting where all stakeholders discuss the issue when it seems like there is nothing left to do to move forward.

I don't have experience as a primary support engineer, so I cannot say anything for certain, but if you are going to be a support engineer, I think primary support might be suitable for those who want to oversee a wider range of the system and be at the hub where things happen.

Conclusion

I believe support engineering is attractive for people who mainly want to act like detectives—forming hypotheses based on evidence about what happened when a problem occurred and then verifying them. Also, by experiencing support engineering, you can learn firsthand what to look out for when developing software to speed up problem resolution and increase the probability of a successful fix, which is useful for maturing as a developer.

I hope that reading this article helps more people understand what support engineers are and encourages more people to try it themselves. I also pray that there will be more cases where SEs, support engineers, and developers work together constructively to resolve issues when they occur.

脚注
  1. Especially from developers who don't know support. ↩︎

  2. git has a subcommand called bisect specifically for doing this. ↩︎

  3. Occasionally, there are super support engineers who have both deep and broad knowledge. ↩︎

Discussion