iTranslated by AI
Reflections on "I Don't Understand Platform Engineering"
Yesterday, I read an article titled "I don't understand Platform Engineering".
Recently, I have been participating in Platform Engineering Meetup online to listen to talks, reading the CNCF "Platforms White Paper", sharing my thoughts in a Zenn entry, and discussing Platform Engineering with colleagues at internal study sessions. Through these, my understanding of "what Platform Engineering might be" has become clearer than when I first started.
So, after reading the aforementioned article, I'm writing this entry to share my opinions based on my understanding. Since I'm not someone whose primary role is Platform Engineering, there may be many areas where my understanding is insufficient or misunderstood. On the other hand, the fact that I have almost no vested interest in this topic might be an advantage in terms of eliminating bias.
Why is Platform Engineering necessary?
Platform Engineering is a supplementary concept born to address the "side effects (challenges)" that arise as corporate and organizational IT environments become cloud-native and modern development and operation methods and cultures like DevOps and SRE advance. Therefore, whether Platform Engineering is necessary depends on how much those side effects (challenges) apply to you. In other words, if side effects (challenges) are occurring to the extent that Platform Engineering is required, that organization can be said to have a very high level of maturity as a cloud-native organization.
In organizations where DevOps and SRE cultures have taken root, the central idea of corporate management is to prioritize release speed and frequency of functions to respond to market changes and win against competitors, while still emphasizing stable system operation. For such organizations, while cost remains an important factor, differentiation through development speed and technology is a major concern.
Before DevOps and SRE, application developers and infrastructure personnel were completely separate, creating a structure where the respective performance metrics—feature release versus stable operation—were in conflict. Organizations that prioritize the speed of response to market changes have addressed this issue by organizing DevOps teams as "One Teams" with information sharing and mutual respect, and by introducing SRE concepts like error budgets, striving to improve development speed and functional enhancements while still valuing system uptime.
Around this time, IT technology also underwent significant changes. This included the advancement of technologies such as virtualization, containers, and IaC, and the utilization of managed services represented by the spread of public clouds. Previously, it was common to start from the procurement of physical hardware and its delivery to a data center, but it has become possible to build and manage IT systems in a "Software Defined" manner in a vast number of areas. Since this does not involve physical work, the areas that software engineers can handle are increasing.
The fact that software engineers in DevOps and SRE teams can gradually handle more areas in a Software Defined environment, and can easily select and utilize the best tools within their teams, is a major advantage in terms of optimizing the development environment for specific project characteristics. On the other hand, while they previously only needed to focus on application code and CI/CD pipelines, their areas of responsibility have expanded to include provisioning container orchestration environments and integrating observability and security tools. Selecting the optimal tools for the team from among multiple options and configuring them themselves is a burden for software engineers, and they are approaching the limits of individual human capacity. This burden is called "Cognitive Load".
Platform Engineering was proposed as an approach to address the challenge of how to reduce the "Cognitive Load" of software engineers—who are expected to develop application code and contribute to feature releases—and how to achieve the "improvement of development speed" that is a central concern for the organization. These challenges are issues that are felt most acutely by organizations where DevOps and SRE methods and cultures are sufficiently applied and where increasing development speed is considered more important than anything else for business growth. Without this awareness of the problem, I don't think one can fully appreciate why Platform Engineering is necessary, and attempting to apply the concept of Platform Engineering in such a state would likely end in failure.
What is Platform Engineering?
The CNCF "Platforms White Paper" provides a concise summary, so I will omit the details in this entry. I have written my impressions of it here.
Answers to Questions from "I don't understand Platform Engineering"
Is a developer portal necessary?
In platform engineering, a developer portal seems like a panacea, but is this correct? Or can tools like Jira, GitHub, or existing CI/CD tools not fulfill the role of such a developer portal?
A developer portal is a tool for providing a consistent self-service interface to users (such as DevOps teams) within the platform provided by the Platform Engineering team. In CNCF projects, Backstage is a representative tool.
According to the Internal Developer Platform community, introducing a developer portal product should be considered when flexibility in access control or cost management capabilities is needed; otherwise, using existing tools like Heroku is more reasonable (dedicated developer portal products may be overkill). Furthermore, for small organizations with DevOps teams of 15 or fewer people that are functioning well with existing members, or when the IT system environment is monolithic or uses only a single public cloud, there is little necessity to intentionally introduce one.
Therefore, it can be said that a developer portal may not be necessary in some cases. The aforementioned "Platforms White Paper" also mentions that the simplest platform is a Wiki page with links to standard deployment procedures, so the introduction of a portal is not mandatory. However, GitHub and Jira are essentially source code repositories and issue trackers, and they are not products optimized for controlling what functions are provided to whom as a platform. As the number of platform users increases, it becomes a scalability constraint for the Platform Engineering team to respond to individual requests every time, so I believe a stage will inevitably come where it is rational to introduce a dedicated developer portal product that can be used in a self-service manner.
How is it different from traditional common infrastructure?
I can't think of any difference between platform engineering and existing methods other than the developer portal.
The "existing methods" here refer to "traditional common infrastructure," and the main point of difference lies in the "what to prioritize" and "philosophy of standardization," as mentioned in APC's slides. Traditional common infrastructure focuses on "quality, governance, cost, and efficiency" and aims for "top-down efficiency." In contrast, a "Platform" aims for "improving developer experience and release frequency while ensuring reliability" and is characterized by being "selectively configurable by developers."
The most prominent manifestation of this character difference is that "developers are not forced to use the functions provided by the Platform." Generally, in organizations with common infrastructure, the use of that infrastructure is imposed as a requirement to recover the investment cost. Even if it is not used, it is necessary to explain a "rational reason for not using it" and obtain exception approval. However, since a Platform does not have to be used, developers will not use it if there is no benefit. For a Platform Engineering team, if there are no users despite providing a Platform, there is no reason for their existence. Therefore, they strive to improve and provide functions as a Platform that will be used. This is a Product Management (PdM) mindset, and it's a characteristic called "Platform as a Product" in Platform Engineering.
This is not a matter of which is better or worse, "traditional common infrastructure" or "Platform." Because "traditional common infrastructure" requires semi-mandatory use, developers may be forced to use it even if there is a slight gap with individual development team requirements, which can impair the developer experience. While I don't think they ignore user opinions just because it's semi-mandatory, prioritizing investment recovery with low operating costs based on a cost-oriented approach might mean that sufficient costs cannot be allocated to continuous improvement. On the other hand, the fact that a certain number of users can be expected makes it easier to create investment plans and take business decisions for introduction. There's a major advantage in being able to select tools through global optimization and ensure IT system consistency and governance due to its semi-mandatory nature. In that sense, the example of the "central kitchen" mentioned in the article is, in my understanding, a representative example of the top-down thinking of common infrastructure rather than Platform Engineering.
On the other hand, since usage of a Platform is not guaranteed, it must continuously provide something easy to use based on the needs of the users (developers). If successful, the developer experience may improve, but it would be difficult to plan in advance whether the costs spent on the Platform Engineering team can be sufficiently recovered. Additionally, since it's unclear if the features will be used, the effects of promoting standardization of development environments expected in common infrastructure are not always guaranteed. Furthermore, the "improvement of developer experience and release frequency" aimed for by the Platform is an indirect value obtained through the success of the user (development team) side. Gartner points out the difficulty in that a Platform Engineering team must prove their value to internal executives to receive recognition.
Normally, executives try to strictly manage the ROI of investments, so I think the hurdle for making a business decision to introduce Platform Engineering is higher than for common infrastructure. This is why I mentioned earlier that introducing Platform Engineering is difficult in organizations that do not have a visceral sense of the problem. Securing an executive sponsor is a key point for introducing Platform Engineering, and I believe this is the reason why the "Platforms White Paper" is written to be used for persuading corporate leaders.
What does the Platform Engineering team do?
The role of a platform team that boasts "one person can cover up to 100 developers" is also a mystery. In the first place, is the platform team in a position to operate a developer portal? Can tasks that were traditionally handled manually by infrastructure teams be replaced so easily by tools? Creating an environment to improve the developer experience would likely require a significant amount of skill and popularity, wouldn't it?
The role of the Platform Engineering team is to take actions to "reduce cognitive load" for the user development teams. Their job is to interview users to identify the problems they face and present minimum countermeasures to solve or mitigate those issues.
As explicitly mentioned in the "Platforms White Paper," a Platform does not aim to operate or provide systems like a traditional common infrastructure. Rather, it is recommended to utilize third-party managed services as much as possible, and for the Platform Engineering team to avoid owning systems themselves whenever possible. When a developer portal is introduced, its operation may be necessary, but beyond that, a model that focuses on curation services without owning other systems is one possible form. Since the Platform Engineering team does not own physical systems, it should be able to coexist with common infrastructure. It might even be possible to organize a Platform Engineering team as a bridge to utilizing common infrastructure.
Will Platform Engineering solve the talent shortage?
I understand that platform engineering was proposed on the premise that general companies cannot hire "ironman" like full-stack engineers... (omitted) I didn't understand how platform engineering acts on the issue of engineer shortage—whether it improves quality or allows for fewer people.
Platform Engineering will promote the division of labor, allowing development teams to focus on development tasks, which means full-stack proficiency is no longer required (i.e., being able to just develop is enough). If the skill area narrows, it becomes easier than before to secure and train IT engineers, and improvement in quality can also be expected.
As mentioned above, as the Software Defined world expands, a wide range of skills has become necessary for IT engineers, including not only code development and CI/CD pipeline settings but also container environment provisioning and monitoring/security settings, requiring them to be full-stack. However, under the premise of selecting from tools chosen as the Platform according to standard procedures, such infrastructure-related skills are not essential, creating room for those who can only write code to play an active role.
In this case, the Platform Engineering team will need IT engineers specializing in the infrastructure domain, but those IT engineers might not necessarily need to be able to develop applications. Also, since the functions provided by the Platform will be used in common across multiple DevOps teams, the effect of consolidating personnel will be higher than having IT engineers who can handle those functions in each team.
This story of Platform Engineering looks like a structure where, after application development teams and system operation teams were consolidated into DevOps teams, they are once again dispersed into application development teams and Platform Engineering teams. Of course, it hasn't simply returned to the previous state; it can be said that it is a dispersion (division of labor) into a different form, conscious of the challenges associated with consolidation and responding to the evolution of the IT system environment. Generally, when the unit of work becomes smaller, more people become able to do that work, so I think Platform Engineering has a certain effect on solving the talent shortage in that it does not require a wide range of advanced skills.
When I spoke with my colleagues about Platform Engineering, we also discussed how the journey to the stage where it becomes necessary (i.e., until cloud-native adoption progresses enough for the problems Platform Engineering addresses to emerge) is quite a challenge for many companies and organizations (and that we need to keep working hard).
While keeping in mind the challenges where Platform Engineering becomes necessary beyond the promotion of cloud-native, DevOps, and SRE, I believe it is important to first firmly promote the modernization of one's own IT environment and the cultivation of culture. Then, I think it is important to check if you are truly at the stage where Platform Engineering is necessary and, if so, consider its introduction.
Update History
2025/7/21 Changed the article title to "Thoughts on..."
Discussion