iTranslated by AI
Why Troubleshooting Integrated Systems Is Difficult and How to Decide Between Outsourcing and In-house Development
Introduction
This article outlines why incident investigation becomes increasingly difficult and the adjustment costs for recovery rise as you integrate with more external services and third-party systems. In addition, I will summarize practical criteria for determining which areas should rely on external services and which should be developed in-house to maintain control. The focus is on common bottlenecks such as multi-vendor coordination, time zone differences, language barriers, responsibility boundaries, and the gap between regulatory reporting and user communication.
High Levels of Integration Mean Outsourcing Key Authority
Increasing the number of integrated systems is not simply about adding features. It brings you closer to a state where your service's fate is tied to the operations and specification changes of other companies.
Specifically, the following constraints arise:
- You cannot decide the recovery speed during an outage on your own.
- You are forced to follow the timing of API specification changes.
- You are directly affected by price revisions or changes in contract terms.
- Investigation speed is influenced by the quality of the partner's support desk.
While integration is convenient, it always carries the trade-off of reduced controllability.
Reduced Visibility During Outages
As the number of integration points increases, incident investigation becomes more difficult. This is because causes do not remain confined within a single system but span across boundaries.
Common bottlenecks include:
- Time spent identifying where your responsibility ends and the partner's begins.
- Differences in log granularity and formats among companies, making it difficult to align timelines.
- Challenges in reproduction due to differences in retry and timeout specifications.
- Multiple support desks, leading to repeated back-and-forth requests for investigation.
- The fact that partners do not always share necessary information, leading to delayed decisions due to the absence of critical data.
In this state, communication costs become more of a bottleneck than technical difficulty. In particular, information considered important internally may be judged as low priority by the system provider and excluded from reports or sharing. As a result, investigations proceed with mismatched assumptions, leading to delayed recovery when the lack of information is discovered later. This is another classic example of adjustment costs.
"The hell of triage" often seen in the field looks something like this:
- Partner A: We cannot provide logs.
- Partner B: The phenomenon cannot be reproduced.
- Internal SRE: Which timeout value is correct?
- Product Team: When will it be fixed?
- CS: We cannot explain this to the customer.
Once this back-and-forth begins, the ratio of adjustment tasks to technical investigation rises sharply, often extending the time to recovery.
Furthermore, in cross-company investigations, decision-making slows down as more companies become involved. Especially if even one overseas vendor is included, scheduling joint meetings is difficult due to time zone differences, and first responses can be delayed by half a day to a full day. Additionally, the language barrier makes it difficult to convey the nuances of the incident, and if interpretations of logs or reproduction conditions differ, the triage process is further prolonged.
Areas Where Dependence Is Difficult to Avoid
In reality, you cannot bring everything in-house. From the perspectives of expertise and institutional requirements, there are areas where reliance is rational.
- Areas requiring expert knowledge, such as advanced fraud detection or identity verification.
- Areas where connection reliability and proven track records affect approvals, such as payments or financial connections.
- Areas where using certified services is practical for legal or audit compliance.
In other words, the point is not whether to rely on others, but how you design that reliance.
In-House Development Reduces Adjustment Costs More Than Implementation Effort
The value of in-house development is not just about development speed. It significantly shortens the decision-making and investigation paths during incident response.
- You can standardize log design according to your own criteria.
- You can increase or decrease monitoring items to match business requirements.
- You can determine the priority of specification changes yourself.
- It is easier for the parties involved to resolve issues directly when they arise.
As a result, recovery times and operational stress are often reduced. Lowering adjustment costs often leads to a decrease in the frequency of troubles, incidents, and outages themselves.
Practical Criteria for Separating Reliance and In-House Development
Rather than making black-and-white decisions for everything, it is more practical to judge each area based on the following perspectives:
- Substitutability: Can the partner be switched in the future?
- Observability: Can you obtain the necessary logs and metrics during an incident?
- Change Resistance: Is the impact limited when a specification change occurs?
- Contract Risk: Do the SLA and support conditions match actual operations?
- Business Impact: To what extent will sales or operations stop if the connection goes down?
Looking through these criteria, the priority for in-house development increases for areas closer to your core business.
You Do Not Need to Own Everything, Just the Application Level
That said, maintaining infrastructure platforms in-house is often impractical considering the operational personnel and 24/7 support requirements. The key is not full-stack in-house development, but having control at the application level.
- Business logic
- Data models
- Design of monitoring and logs
- Triage workflows during incidents
If you can design this layer internally, it becomes easier to ensure operational quality and improvement speed, even if you entrust infrastructure or certain platforms to third parties.
Additionally, there is a difference in terms of external communication during outages. Wide-area outages of major cloud providers are easier to explain as external factors because multiple companies are affected simultaneously. On the other hand, non-infrastructure application outages are easily perceived as issues with the user company's design or operations, making them harder to explain to users.
In sectors like finance, there are systems that allow reporting to regulatory authorities as "external factors not caused by the company." However, being able to distinguish causes by regulation is a different issue from maintaining user trust. Even if you can categorize it as an external factor in a report to the Financial Services Agency, the fact that the service stopped remains unchanged from the user's perspective. That is precisely why it is worth retaining control over the application layer's observability and recovery workflows.
Operational Costs Can Justify In-House Development
From the perspective of operational costs, in-house development can become more rational than external reliance as the scale increases. A representative example is Netflix, which built its own CDN, "Open Connect," to optimize delivery quality and costs in its core area of video streaming as its scale grew. Meanwhile, they did not move everything to self-built infrastructure and continued to utilize AWS. This case demonstrates that designing by separating core and non-core areas to maintain control is effective, rather than bringing everything in-house.
However, this judgment is not directly applicable to every company. The important thing is to periodically compare external usage fees with internal operational costs to determine which is sustainable for your company's scale.
Relying on Outsourcing for Core Areas Can Be Disadvantageous During Growth
Outsourcing core parts also means handing over the right to decide the fate of your service to others. This may work during the startup phase, but in stages where the business scales, there are many cases where in-house development should be considered.
Through in-house development, the speed of feature improvement and the cost structure can change dramatically, which can lead to significant business growth. The example of Netflix, where investment in core areas became its competitive advantage, is a clear instance of this.
Conversely, if you avert your eyes from the core and try to appeal to users only through non-core areas, it eventually becomes a competition of advertising spend. Unless you are a company with significant capital, this is practically difficult to sustain.
Conclusion
Having many integrated systems leads to a decline in controllability alongside functional expansion. During incidents, the complexity of responsibility boundaries and coordination becomes a bigger problem than technical issues.
On the other hand, there are definitely areas where reliance is difficult to avoid, such as specialized services or financial connections. That is why it is important to ensure investigation capability and decision-making speed by developing core parts in-house, while still operating under the premise of some reliance.
Discussion