iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🚀

Exploring a Case Study on Technical Debt

に公開

https://twitter.com/MinoDriven/status/1668108150858461184

It looked interesting, so I'll give it some thought.

Aligning Direction with Management

First, I would explain to the management that the development team's performance is relatively low based on development metrics such as the Four Keys (DORA metrics), MTTx, and Daichi Hiroki's d/d/d metrics, and share a sense of the challenges.
Based on that, we would discuss how to respond appropriately given the current situation and decide on a direction based on the agreed-upon content, business conditions, and the authority granted to me.

Especially in the case of a startup, the CEO's perspective is crucial, and it's difficult to proceed without building consensus there.
Thinking paradoxically from the situation, it's highly likely that the overall company mindset, including the CEO's, is "If we do refactoring, feature development will be delayed."
It's possible that the company is in a critical situation where immediate feature development is truly urgent.
Since it's hard to say that "refactoring will produce dramatic effects immediately," I would focus on building a solid consensus on how much room there is to invest from a medium- to long-term perspective.

Conversely, if the business situation allows for the costs of a replacement and there is a desire to improve from a medium- to long-term perspective, replacing the system is the quickest way.
In that case, I'd want to secure agreement after emphasizing the premise that "we cannot guarantee the exact same functionality as now."

Even if it's not a complete replacement, there are various directions, such as partially decoupling and replacing parts, or building new features on a new foundation.

However, in this case study, since the members' awareness of design is low, even if we rebuilt it, the same cycle would likely repeat unless the fundamental culture is improved. A half-baked replacement might actually lead to more chaos.

Reducing Change Failure Rate

I will consider a case where a full replacement isn't possible due to business conditions, but I've been given a certain degree of authority to fix things.

It's difficult to do everything at once, and trying to change the culture drastically and overnight often leads to friction, so I would first focus on improving the Change Failure Rate.

This is an area with a high impact on users, so I want to resolve it first.
Furthermore, since the effects are easy for everyone to feel, including non-engineers, achieving results here will build trust and make it easier to take further actions.
Additionally, by being conscious of this, I can expect that the perspective of "this way of writing is bad because it easily leads to errors" will improve the members' awareness of code quality.

Add Change Failure Rate to KPIs

I could work tirelessly on my own or set up a DevOps team to track the Change Failure Rate, but from a medium- to long-term perspective, awareness must grow in all development members.
Especially in cases where someone else fixes a bug without giving feedback to the original developer or reviewer, the situation will never improve.
While we should not create an atmosphere of finger-pointing or blaming, the structure should ensure that stakeholders take a certain level of responsibility.
Therefore, I would add the Change Failure Rate to each team's KPIs and have them work autonomously toward improving their own team's rate.
If possible, I would also include the team's Change Failure Rate in the members' performance evaluation criteria.

When an error occurs, I would have them write an incident report and make it mandatory to propose measures to prevent recurrence.
I would leave the specific strategies to each team without interfering too much if they are members who can act autonomously, but a good initial direction would be to set up E2E tests or controller-level tests.
Although this is the opposite of the Testing Pyramid, covering a wider range with a single test offers higher cost-performance and makes the effects easier to feel than enriching unit tests.
Furthermore, when moving toward refactoring in the future, many unit tests would likely need to be rewritten if they were built to match the current poor design.
First, I want to cover the entire code with quick and broad-reaching tests, leading to a situation where we can feel secure if the tests pass.

Also, since there are many development members, I assume they are split into multiple teams.
Therefore, it might be good to visualize the Change Failure Rate for each team, recognize outstanding teams, and provide a forum for teams to regularly share their initiatives.

In the meantime, I would likely focus on modifying infrastructure parts that affect the entire system, such as building an E2E foundation and improving CI/CD flows.

Raising Awareness of Design

Since the members' low awareness of technology is the primary issue, I would hold study sessions or reading circles to improve things at a conceptual level.
If awareness regarding the change failure rate has increased, it will lead to thinking about how to write code that is less prone to defects, and I hope that more people will begin to autonomously notice problems such as God classes and mutability.

Furthermore, if test code is already comprehensive, they will be able to make changes without fearing bugs, which should lower the barrier to refactoring.

Once the culture has been fostered to some extent as a whole, it might be good to establish a DevOps team and have them focus exclusively on improving the four DevOps metrics.

My own role would likely involve moving to fill whatever gap is most prominent.
Whether it's pushing refactoring forward, focusing on code reviews, or joining feature development, I should be able to act more freely once the members start driving improvements autonomously.

Conclusion

In reality, the "correct" answer is "it depends" or "I won't know until I talk to them," but since that wouldn't work as a thought experiment, I've written this with certain assumptions about the situation.

I believe the biggest issue is not the "absolutely poor situation" itself, but rather a "culture where problems are left unaddressed and improvements do not progress." Therefore, identifying the root cause of that culture is the most important step.

For instance, in a culture of blame, people cannot show vulnerability and will repeatedly justify the current state. Furthermore, if engineers haven't built a relationship of trust, any attempt at improvement might be viewed as "just looking for an excuse to do what they want to do!"

Regarding this case study, I saw several opinions suggesting "replacing the people." In fact, it is a quick way to change culture drastically, and I have seen cases where it succeeded, so it might be worth considering as an option.

Discussion