iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

re:Invent 2025: Three Practical Use Cases for AWS Operations Automation Using Autonomous AI Agents by Mantel Group

に公開

Introduction

By transcribing various overseas lectures into Japanese articles, we aim to make valuable hidden information more accessible. The presentation featured in this project, based on that concept, is this one!

For transcribed articles from re:Invent 2025, information is compiled in this Spreadsheet. Please check it as well.

📖 re:Invent 2025: AWS re:Invent 2025 - Reimagining AWS operations with autonomous AI agents (DEV207)

In this video, Geethika Guruge from Mantel Group introduces three AI agent use cases built in production. For compliance automation in migrating over 500 servers, agents read criteria from Confluence and create pull requests for Terraform code. In modernizing an Amazon EKS environment, monitoring CloudWatch logs improved incident resolution time by over 90%. For automating low-value EKS requests, Slack triggers handle namespace creation and quota increases, reducing support costs by 20%. Specific challenges and lessons learned in implementation, such as fine-tuning system prompts, federated permission management, and the importance of observability, are shared, and it is recommended to start with high-volume, low-value workflows.

https://www.youtube.com/watch?v=U7Nkuyt6X0M
※ This article is automatically generated while maintaining the original lecture content as much as possible. Please note that there may be typos or incorrect information.

Main Story

Thumbnail 0

The Value Proposition of Autonomous Cloud Operations: From Manual Work to AI Agents

Hello, everyone. And thank you for coming to Mandalay Bay. I know it's a Wednesday afternoon. Is it Wednesday? It is, right? Well, this is re:Invent. So, before we get started, how many of you have built AI agents in production? Anything running in production? Good, good. You guys can probably leave now then. But anyway, let's get started.

My name is Geethika Guruge. I work for a company called Mantel Group. This is an AWS partner, a consulting partner based out of Australia and New Zealand. I'm based out of New Zealand. And today, I'm going to tell you a couple of stories about some agents, or multiple agents that we've built for a couple of our customers and the efficiencies we've gained from it. And most importantly, sharing the lessons learned.

Thumbnail 60

This is today's agenda. First, I'll spend a few minutes explaining why autonomous cloud operations are important and their value proposition. Then, I'll talk about the use cases I just mentioned. Of course, we'll also look at them from an architectural perspective. Because a presentation isn't complete without architecture diagrams, right? And as I said, most importantly, I'll share the challenges and lessons learned, and what you can do when you go home.

Thumbnail 90

Looking back at cloud operations today, and the last seven, ten, twelve years, we've constantly been migrating to the cloud. And every time we do that, and every time we modernize on the cloud, it means more services, more environments, and as a result, more logs, more support tickets, and so on. Also, compliance teams and security teams are becoming increasingly nervous, and there's considerable pressure on security and compliance. As you know, manual remediation takes time. And more than remediation, triage takes time. Once you triage and understand the root cause, the remediation itself isn't that difficult. But identifying the root cause takes time.

And tomorrow, in the future we envision, autonomous agents will exist. Operations will become autonomous, and AI will come to understand the cloud, code, and policies. In this case, policies mean compliance, security, and so on. Also, and most importantly, there's the aspect of human in the loop, as needed. At this point, I don't think there's any organization that's comfortable letting agents run wild, so this human in the loop becomes very important. Humans approve, oversee, and most importantly, innovate. Because being freed from the tedious tasks of manual remediation and triage allows them to tackle more innovative and enjoyable tasks.

Thumbnail 180

Three Practical Use Cases: Compliance Automation, Modernization Support, and EKS Support Efficiency

So, I'm going to talk about the three challenges we faced, which are the three use cases. The first one was about compliance. We were doing a large-scale migration of over 500 servers. This was for an insurance client. They had a very strict compliance regime, and there were several others. There were numerous compliances that had to be met in the new environment. Each server being migrated had to be checked off against those compliances. And those compliances were documented quite well on a Confluence wiki page. So, the compliance criteria were in Confluence. And there were other criteria as well. Anyway, what we follow when we do migrations is to treat all infrastructure as code, which in this case was Terraform.

Thumbnail 230

So our solution was that after we migrated a server or spun up an environment, we would trigger an agent. That agent is exposed via an API, so we have an internal tool for migration, and at the end of that workflow, it calls this agent. The agent goes and reads the compliance document from that wiki, from Confluence, and then assesses the server or the environment against that compliance document. If it finds gaps, first, for traceability, it creates a Jira ticket, but most importantly, it goes into the code repository and creates a pull request—it takes a branch, makes the changes it deems appropriate, and creates a pull request. Then an engineer comes in and reviews the pull request, and in most cases, some fixes or changes will be made, but that initial phase is done.

Thumbnail 290

And of course, the benefit of this is automated enforcement of compliance at scale. As I mentioned, this is over 500 servers, numerous environments, and imagine a human having to read that document and check off each server every time they migrate a server or spin up an environment. That would be a tedious task. So now, the agent is doing that tedious task for us. This, of course, made migrations faster.

More importantly, because it creates a Jira ticket and everything goes through a pull request process, everything is documented, and that traceability is ensured.

Thumbnail 330

Next is a modernization project. When migrating, it's not just a lift and shift. We do a lot of modernization. What we found here was that when modernizing an application, in this case, a monolithic Java application running in an Amazon EKS environment, we were containerizing it and putting it into EKS. Most of the errors we encountered, and most of the errors we traditionally encounter in these kinds of projects, are configuration issues or security issues. The biggest problem is that when doing this kind of modernization, SMEs are already bottlenecked with other tasks, yet we have to go to them to understand what's happening with the application.

But if you think about it, looking back, this is an application that's already running. So, fixing it shouldn't be such a difficult task. In many cases, the problem is some kind of misconfiguration or security permission, such as IAM permissions. However, it takes too long to triage this manually, and the SMEs are a bottleneck. That's another aspect.

Thumbnail 400

So what we did was deploy this agent again. This agent monitors CloudWatch logs, and in this case, all logs were sent to CloudWatch, but it can monitor any logs. Every time a test case runs, if an error occurs, it identifies it. The agent is triggered, identifies the root cause, and as a last resort, creates a Jira ticket and raises a pull request. That initial triage and the first pull request are created, and then a human can come in and do the rest of the work.

Thumbnail 440

This actually improved incident resolution time by over 90%. Because now, most of these small issues are handled by agents, and most of these pull requests didn't have any further commits. They were just approved. The dependency on SMEs was reduced, and that was the biggest aspect. Because now, we don't have to depend on SMEs. They can do their actual jobs, and we only go to them when there's a real problem. Of course, this is a continuous improvement task because we integrated it with the CI/CD pipeline. So now, this agent continues to run not only for migration but also when applications are already modernized and running in production. Every time they make the next check-in, if there's any error, any test error, this agent will kick in and raise a pull request.

Thumbnail 490

And finally, I'm going to talk about automating low-value requests in EKS. After this customer went live, as I mentioned earlier, this was a brand new EKS environment for them, so they had to set up a platform team. They already had a DevOps team, but DevOps is not platform engineering. Well, that's another story. Anyway, they built this brand new platform team, but what they noticed was a very high volume of support tickets. Because now, application developers were requesting quite a bit of support from the platform team for some low-value tasks. When I say low-value, this could be creating a new namespace or increasing a quota within a namespace.

Of course, application teams need to rely on the platform team to complete these tasks, but this is not the best use of the platform team's time. And of course, there were long waiting times and reduced user satisfaction because it was taking too long for the platform team to complete these small things.

Thumbnail 550

So what we did here was, this agent is triggered by Slack, and it gets sent to the agent. Basically, a Slack message triggers the agent. The agent has access to logs and also, through an MCP server, has access to Confluence, any necessary documentation, and compliance items to follow. It then makes a simple fix and raises a request. One of the interesting things here is that we used an AWS EKS server to look at the EKS cluster and understand what was happening, but we used it in read-only mode. Because we didn't want to apply changes directly there. We wanted changes to go through infrastructure as code, so it looks at the cluster, validates against the request, and then applies the change to the repository.

Thumbnail 610

Again, this reduced support costs by 20% because the support team no longer had to perform all these tasks. They just had to review the pull request, make a small change, approve it, and push it. Of course, this also speeds up resolution, because up to the pull request stage, it's completed within minutes, within five minutes. Initially, we were running this on Lambda before Amazon Bedrock Agents became generally available. And we had set the Lambda timeout. Obviously, Lambda times out in 15 minutes, so we set it within 15 minutes. But in reality, most things were getting done within five minutes.

Thumbnail 670

Architectural Design and Implementation Lessons: From System Prompt Fine-Tuning to Gaining Business Trust

Again, this is scalable. Because now, more and more applications are coming into the EKS cluster, and more and more application teams are relying on the platform team. But the platform team doesn't need to scale even if the number of applications scales, because the agent scales, and the platform team just approves pull requests. So, if we look at the architecture here, there are two input paths. One is through Slack, but it doesn't have to be Slack. As you can see, it's an API Gateway, so you're basically calling an API. The other path is through CloudWatch, and again, it doesn't have to be CloudWatch. It can be any log or any observability platform. Basically, it's about triggering an alert that contains all the error information.

Once it goes through the CloudWatch or API Gateway path, it enters Lambda. What we do in this Lambda is create an agent runtime and invoke that runtime. Since we are using multiple agents here, we are using a swarm pattern with an orchestrator. What happens is the orchestrator receives a request from Lambda and decides which agent to invoke. We have multiple specialist agents: Confluence, Jira, PR, and an AWS agent. The PR agent is responsible for creating branches, executing pull requests, and all that stuff. The AWS agent has all the tools to go into the AWS account and inspect the AWS account. As I mentioned earlier, it's like an EKS MCP server. This is the high-level architecture.

Then there's the agent core memory, and what happens with this is that every time a request comes in and a root cause is identified, that part is stored in memory. So next time a similar request comes, the agent learns, or the agent knows what worked well before. And based on that, it can do a better job. And of course, agent observability, which is everything that happens here. OpenTelemetry logs are sent to CloudWatch. There are third-party tools, but we found that simple CloudWatch was sufficient. It's not great for log display, but if you're willing to spend time in the logs, you can track it.

Thumbnail 790

I think this is the most important slide in this entire presentation. It's about challenges and lessons learned. The biggest challenge we faced was fine-tuning the system prompt. As I mentioned, we had multiple agents. Each agent was given its specialization, or persona, through a system prompt. This is a prompt that tells it, "You are an expert in this field, you can do this and this, and these tools are available to you." What we did was create an initial system prompt and ran the system, and naturally, we didn't get the desired results, so we iterated on the prompt. Fine-tuning system prompts is not really a science; it's a fine line between art and science.

Some of the challenges or some interesting aspects we discovered were that at one point we had a pull request come in regarding a security permission issue. It was a test case. Every time we ran this, we would manually remove permissions to generate an error, which would then trigger the agent to fix the error. But what happened after the third or fourth time was that the agent directly applied the change to master. Looking at the logs, it said, "I see five branches created for the same issue, and they're not getting merged to master. I've decided that there's an issue with merging pull requests, so I'm applying it to master."

You have to guard against things like that. Of course, you can guard against it in the repository itself, but in the prompt, you also have to say, "Never, ever merge to master," because otherwise, the agent can become too smart, or try to be too smart. Things like that. And then another angle was when we implemented these bug fix and error fix agents.

We only wanted it to fix that specific error. Because, obviously, these are existing code repositories with huge technical debt. But the pull request agent was trying to fix everything across the entire codebase. When we tried to tell the agent to focus only on this specific issue, the agent got confused. Because the orchestrator agent was asking the pull request agent to fix something. So, we had to fine-tune all of this.

What we ended up doing was creating two paths for the orchestrator agent based on where the request was coming from. Based on the request pattern, it identifies whether this is a surgical pull request path or an improvement task, and that flag is passed to the pull request agent. At that point, the pull request agent knows whether it should make a surgical fix to resolve this specific bug, or if it has the keys to paradise and can improve the entire report. System prompt fine-tuning is an art, and it will probably become a science when people gain more knowledge, but it was interesting.

Of course, federated permissions were important. Because we're using multiple MCP servers. Most of these had to pass through security. For example, GitLab and GitHub have security, and Confluence and Jira require credentials. We also needed to understand whether the end user who was triggering this actually had enough permissions to perform this action. In our case, for these three use cases, since it ended at the pull request stage, it wasn't too big of an issue. We created system users for these MCP servers, so every time a pull request is created or a Jira is created, it runs as the API for that agent.

Gaining trust from the business, interestingly, wasn't as hard as we thought. But I think that's because we chose high-value tasks, specifically high-value, low-risk tasks. They were happy to trust us with that, and after building it, business trust gradually became more important. For monitoring and observability, as I mentioned, it's very important to use agent observability and other tools. Because you need it to understand what's happening, especially while fine-tuning the prompts.

Even afterward, you need it to understand how much it costs, what are the correct responses and incorrect responses. I believe yesterday there were also evals within the agents, so that allows you to understand how much of the prompt, or how many of the requests, actually evaluated successfully and led to a successful outcome. Monitoring and observability, like any other IT system, are extremely important and must be first-class citizens.

Thumbnail 1090

Final point, if you want to build agents, these are the lessons learned, and this is what we did. You can identify high-volume, low-value workflows. Because high volume means high impact. If you automate that, the business will be on your side, and sometimes it can be low value. Because then the business considers it low risk. Even if it fails, it's not a big deal.

Then, focus on compliance, troubleshooting, and support. Because if you think about it, these are already well-documented processes. In any organization, these things are already well-documented and are very streamlined workflows. You already know what to do, so it's very easy to start there, and since you already know the workflow, you can start with a very good system. Then, gather feedback and scale to more complex workflows. Because by iterating, you understand what's happening, and you can leverage that feedback loop to improve. That's all.

Thumbnail 1160

Finally, I want to say that AI agents are not going to replace you; they are not going to replace humans. They amplify your impact. That's how I see these autonomous agents. Thank you.


  • This article was automatically generated using Amazon Bedrock, maintaining the original video's information as much as possible. Please note that there may be typos or incorrect information.

Discussion