iTranslated by AI
Automating Incident Response with Azure MCP Server
Recently, automating incident response through AI has garnered significant attention. For instance, Microsoft aims to improve the efficiency of incident response through automatic ticket resolution, automated triage, and automated report generation [1]. These efforts are referred to as AIOps (Artificial Intelligence for IT Operations) and are actively discussed and applied in both academia and industry.
Originally, utilizing AI required a vast amount of preparation and effort, such as training machine learning models and collecting/analyzing data. However, with the emergence of pre-trained language models (LLMs/Agents) with excellent reasoning capabilities, it has become much easier to experiment with AI. In particular, mechanisms like the Model Context Protocol (MCP), which allows agents to call external tools, have been established, making external integration even more straightforward.
This article introduces how to improve the efficiency of incident response on Azure by utilizing the MCP tool (Azure MCP Server) for operating Azure resources.
What is Azure MCP Server?
Azure MCP Server is an MCP server for executing operations on Azure resources (REST API calls to Azure Resource Manager).
- Azure/azure-mcp: The Azure MCP Server, bringing the power of Azure to your agents.
- What is Azure MCP Server (Preview)? - Azure MCP Server | Microsoft Learn
For example, tools are available to perform operations such as (List):
- Retrieving data from Cosmos DB
- Operating on Azure Storage Blobs
- Executing queries on Azure SQL Database
There are many operations not yet available as dedicated tools, in which case you can use the azmcp-extension-az tool instead. This tool is for executing Azure CLI commands via the MCP Server. However, since the ability to describe command arguments depends on the LLM/Agent, execution errors may occur. To mitigate errors, it is recommended to prepare command samples in advance.
Excerpts of tools that could be used in the context of incident response include the following:
| Tool | Description | Use Case |
|---|---|---|
azmcp-extension-az |
Executes Azure CLI commands | Executing operations not supported by the MCP Server (e.g., retrieving platform metrics) |
azmcp-kusto-query |
Executes Azure Data Explorer queries | Retrieving logs stored in Data Explorer |
azmcp-log-query |
Executes Log Analytics workspace queries | Retrieving logs stored in Log Analytics |
In this automated incident response example, we will use the azmcp-log-query tool to retrieve logs stored in a Log Analytics workspace and investigate the cause of the problem according to a troubleshooting guide.
Authentication Methods
To call REST APIs, Azure MCP Server requires an Entra ID token. Azure MCP Server can obtain tokens through various methods, but the following authentication methods are likely the most representative:
-
Environment Variables: A method that references environment variables (
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_CLIENT_SECRET). Used when running the MCP server on a server, such as in a CI/CD pipeline. (Reference) - Managed Identity: A method using the Managed Identity assigned to an Azure resource. Used when running the MCP server on Azure resources like Azure VMs or App Service. (Reference)
-
Azure CLI: A method using credentials from the Azure CLI (
azcommand). Used when testing the MCP server in a local environment. (Reference) - Azure PowerShell: A method using credentials from Azure PowerShell. Used when testing the MCP server in a local environment. (Reference)
For more details, please refer to the following:
Setup Methods
As of June 2026, Azure MCP Server supports stdio transport and SSE. However, since SSE is deprecated, it is better to adopt stdio.
There are two main setup methods:
-
Node.js: A method of starting the Azure MCP Server using
npxornpm. A Node.js runtime environment is required. - Container: A method of starting the Azure MCP Server as a container. A container runtime such as Docker or containerd must be installed.
For example, when running Azure MCP Server with Node.js in VS Code (GitHub Copilot), configure it as follows:
{
"mcp": {
"servers": {
"azure-mcp-server": {
"command": "npx",
"args": [
"-y",
"@azure/mcp@latest",
"server",
"start"
// "--service",
// "monitor"
]
}
}
}
}
The --service option is used to specify a specific service when starting the MCP Server, but it can usually be omitted. If only the azmcp-log-query tool is needed, you can limit the tools by specifying --service monitor.
When running Azure MCP Server in a Docker container, configure it as follows. The reason for specifying the .env file is that it uses the environment variable authentication method. For details, please refer to the Setup Guide in the official documentation.
{
"mcp": {
"servers": {
"azure-mcp-server": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"--env-file",
"/full/path/to/.env",
"azure/azuremcp"
]
}
}
}
}
Flow of Automated Incident Response
In this article, I propose an incident response flow using Azure MCP Server as follows:

In this flow, there are two things the incident responder must perform:
- Write a troubleshooting guide: Describe in natural language the steps the agent should follow to investigate logs. For example, organize and describe KQL query templates to be issued and how to interpret logs. Since this makes the runtime prompt shorter, it's also recommended to include static information (e.g., the target Log Analytics workspace). It is important to describe only necessary and sufficient information as concisely as possible, as the agent may not be able to interpret it correctly otherwise.
- Request troubleshooting execution: Ask an LLM application (e.g., Claude Desktop, VS Code / GitHub Copilot) to execute troubleshooting. For example, "Please investigate the cause of the failure that occurred around YYYY-MM-DD mm:ss." At this point, the application needs to refer to the troubleshooting guide. You can provide external links or content as context, or include the entire troubleshooting guide in the prompt.
When Step 2 is executed, in many cases, the application will attempt to run the Azure MCP Server log search tool after confirming the information required for execution. If information is missing, it will prompt for confirmation, such as "What is the ID for...?" (behavior depends on the model and agent implementation). Once the necessary information is available, it will retrieve logs according to the troubleshooting guide and analyze the results.
How to Write a Troubleshooting Guide
As mentioned earlier, a troubleshooting guide is a document describing the procedures for an agent to perform incident response. Specifically, you describe content such as the following:
Example of a troubleshooting guide
# TSG: S2S VPN Connection Disconnected
This is a troubleshooting guide describing the diagnosis and cause analysis for when a S2S VPN connection is disconnected.
## Environmental Information
### VPN Gateway
- ResourceId: /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-azure-mcp/providers/Microsoft.Network/virtualNetworkGateways/vpngw-azure
### Connection Destination
1. conn-to-onprem
- Remote IP: x.x.x.x
- IP Addresses: 192.168.0.0/16
### Log Analytics Workspace
- SubscriptionId: 00000000-0000-0000-0000-000000000000
- ResourceGroupName: defaultresourcegroup-sea
- WorkspaceName: DefaultWorkspace-00000000-0000-0000-0000-000000000000-SEA
- TableType: Microsoft
## Notes
- Use the Azure MCP Server's `monitor-log-query` tool to query logs from the Log Analytics workspace. However, specify a window large enough to include the logs, such as 24 or 48 for the hours parameter. Specify more detailed time filtering in the where clause for `TimeGenerate`.
- When executing the query, replace template variables such as `{VpnGatewayResourceId}` with appropriate values.
## Troubleshooting Steps
### 1. Check Tunnel Status
Check the VPN tunnel status to confirm whether the connection is successfully established. Run the query below to check the final tunnel status.
```kql
AzureDiagnostics
| where TimeGenerate between (datetime({StartTime}) .. datetime({EndTime}))
| where Category == "TunnelDiagnosticLog"
| where ResourceId =~ "{VpnGatewayResourceId}"
| where remoteIP_s == "{RemoteIP}"
| project TimeGenerated, OperationName, status_s, stateChangeReason_s, instance_s
```
If the final state is "Connected", it means the connection is successful.
### 2. Check IKE Logs
Run the query below to check the IKE debug logs for the latest session.
```kql
AzureDiagnostics
| where TimeGenerate between (datetime({StartTime}) .. datetime({EndTime}))
| where Category == "IKEDiagnosticLog"
| where ResourceId =~ "{VpnGatewayResourceId}"
| where Message has "{RemoteIP}"
| project TimeGenerated, Message
| sort by TimeGenerated asc
```
IKE logs describe the exchange of IKE messages between both the VPN gateway and the peer device. Below is an example sequence (for the Sender) of a normal IKE session.
```plaintext
[SEND][SA_INIT] Sending IKE SA_INIT packet with tunnelId 0x1, iCookie 0xA9261CDE9DABECD and rCookie 0x0
[RECEIVED]Received Ike payload: Policy1:Cipher=AES-CBC-256 Integrity=SHA1 DhGroup=DhGroup2 Policy2:Cipher=AES-CBC-256 Integrity=SHA256 DhGroup=DhGroup2 Policy3:Cipher=AES-CBC-128 Integrity=SHA1 DhGroup=DhGroup2 Policy4:Cipher=AES-CBC-128 Integrity=SHA256 DhGroup=DhGroup2 Policy5:Cipher=3DES Integrity=SHA1 DhGroup=DhGroup2 Policy6:Cipher=3DES Integrity=SHA256 DhGroup=DhGroup2
[RECEIVED] Receiving MM Packet for tunnel Id 0x0 iCookie 0x3E90974C33F7F48B and rCookie 0xAFADDC42504E1BD: Receiving MM Nonce
[SEND][SA_INIT] Sending IKE SA_INIT response packet with tunnelId 0x0, iCookie 0x3E90974C33F7F48B and rCookie 0xAFADDC42504E1BD
[SEND]Sending Ike payload for tunnel Id 0x0: Auth:PreShared Key- Responder Policy:Cipher=AES-CBC-256 Integrity=SHA1 DhGroup=DhGroup2
[RECEIVED][SA_AUTH] Received IKE AUTH message
[RECEIVED]Received Traffic Selector payload request- [Tsid 0x1 ]Number of TSIs 1: StartAddress 0.0.0.0 EndAddress 255.255.255.255 PortStart 0 PortEnd 65535 Protocol 0 Number of TSRs 1:StartAddress 0.0.0.0 EndAddress 255.255.255.255 PortStart 0 PortEnd 65535 Protocol 0
[SEND] Proposed Traffic Selector payload will be (Final Negotiated) - [Tsid 0x1 ]Number of TSIs 1: StartAddress 0.0.0.0 EndAddress 255.255.255.255 PortStart 0 PortEnd 65535 Protocol 0 Number of TSRs 1:StartAddress 0.0.0.0 EndAddress 255.255.255.255 PortStart 0 PortEnd 65535 Protocol 0
[RECEIVED]Received IPSec payload: Policy1:Cipher=AES-GCM-256 Integrity=AES-GCM-256 Policy2:Cipher=AES-CBC-256 Integrity=SHA1 Policy3:Cipher=3DES Integrity=SHA1 Policy4:Cipher=AES-CBC-256 Integrity=SHA256 Policy5:Cipher=AES-CBC-128 Integrity=SHA1 Policy6:Cipher=3DES Integrity=SHA256
[SEND][SA_AUTH] Sending SA AUTH response message for tunnelId 0x2 and tsId 0x1
[SEND]Sending IPSec policy Payload for tunnel Id 0x2, tsId 0x1: Policy1:Integrity=AES-GCM-256 Cipher=AES-GCM-256
[LOCAL_MSG] IKE Tunnel created for tunnelId 0x2
```
```
Naturally, what to include in the guide depends on the scenario, but it is generally good to include the following:
- Overview of the Troubleshooting Guide: Concisely describe what cases are in scope and for what purpose the guide is used.
- Environmental Information: Clearly state the information necessary to perform troubleshooting. For example, the resource IDs of the target resources, the IP addresses of connection destinations, and Log Analytics workspace information. This information is required for the operations performed by the agent.
- Diagnostic Procedures: Specifically describe what operations to execute. For example, templates for queries to be issued in the Log Analytics workspace or examples of Azure CLI commands.
- How to Interpret Results: Describe how to interpret the obtained information and what results are expected. For example, how to handle specific error codes or how to analyze log patterns. It is also helpful to provide samples of log patterns for both normal and abnormal states.
- Reference Information: Provide links to additional reference information and documentation. For example, Azure official documentation or links to similar past incidents.
- Notes: Describe points of caution when performing troubleshooting. For example, how to specify the time range when searching logs or how to use Azure MCP Server tools.
Writing in a structural format such as Markdown makes it easier for the agent to understand the content (e.g., the difference between a query and normal text becomes clear). It will be easy for humans to understand and will be appropriately interpreted by the agent during execution.
Furthermore, in actual operation, preparing such troubleshooting guides is expected to be a major challenge. It is important to organize the procedures while looking at past response logs, centered around highly experienced engineers.
Operation Demo
Using the S2S VPN troubleshooting guide shown earlier, I will introduce how it works in a demo video.
The scenario in this video assumes a tunnel disconnection event between an on-premises VPN device and a VPN Gateway. After intentionally triggering a disconnection by resetting the on-premises device, I asked GitHub Copilot to investigate the cause around that time.
Below is the final response from Copilot. The time taken to generate the response was about 30 seconds.
VPN Disconnection Incident Investigation Report (Around 2025/6/18 7:20)
1. Summary of Tunnel Status Logs
- 7:23:02 Tunnel Connected
- 7:50:57 Tunnel Disconnected (Remotely triggered)
- 7:50:58 Tunnel Reconnected
2. Excerpt of IKE Logs
- 7:22:14 The tunnel was closed once due to "Negotiation timed out," but reconnected normally afterward.
- 7:50:57 Tunnel disconnected due to "Received IPSec SA delete" and "Main mode SA lifetime expired or peer sent a main mode delete."
- 7:50:58 Reconnection process was performed immediately after, and IKE/SA/Traffic Selector/IPSec negotiations were completed successfully.
3. Summary
- The VPN tunnel was disconnected at 7:50:57 (UTC) due to being "Remotely triggered."
- Logs suggest the disconnection was caused by an SA deletion from the peer device or SA lifetime expiration.
- Reconnection occurred automatically immediately after the disconnection, and IKE/IPSec negotiation was completed successfully.
- This was a temporary disconnection, and no ongoing failure is occurring.
Reference Logs (Excerpt)
2025/06/18 7:50:57 [TunnelDisconnected] Disconnected (Remotely triggered) 2025/06/18 7:50:57 [RECEIVED][SA_DELETE] Received IPSec SA delete message for tunnelid 0x2 and tsid 0x1 2025/06/18 7:50:57 [LOCAL_MSG] IKE Tunnel closed for tunnelId 0x2 with status Main mode SA lifetime expired or peer sent a main mode delete. 2025/06/18 7:50:58 [TunnelConnected] Connected (RemotelyTriggered)Action
- This disconnection was due to an operation on the peer device side or SA lifetime expiration, and no abnormalities were seen on the Azure side.
- Continued monitoring is recommended, but no additional action is required at this time.
The content of the response is accurate. Initially, the user specified the time as "around UTC 7:20," which was actually incorrect; the tunnel reset was actually performed on the peer router around 7:50. In incident response, user-reported times are often inaccurate, so I conducted this experiment intentionally, and it worked well in this case.
In generating this response, GitHub Copilot executed the following steps:
-
Retrieval of tunnel status logs: Retrieve tunnel status logs at the time of the event from the
AzureDiagnosticstable to check the final state of the tunnel. -
Retrieval of IKE logs: Retrieve IKE debug logs around the time of the event from the
AzureDiagnosticstable to confirm the exchange of IKE messages that caused the tunnel disconnection. - Analysis of results: Analyze the retrieved logs and summarize the cause of the tunnel disconnection and the status of reconnection.
It executed the steps almost exactly as specified in the troubleshooting guide, but also made fine adjustments according to the context. For example, it performed the following detailed corrections:
- Specified the time range for
TimeGenerateasbetween (datetime(2025-06-18T07:00:00Z) .. datetime(2025-06-18T08:00:00Z)). While the snippet used template variables like{StartTime}and{EndTime}, it replaced them with actual times. - Specified the maximum number of logs to retrieve as 100. It appropriately limited the number of logs to avoid retrieving an excessive amount.
- Added
sort by TimeGenerated asc. It supplemented the sorting process by time, which was missing in the template query for retrieving tunnel status logs.
This kind of supplemental capability depends on the ability of the language model being used. If a powerful model is not available or if the behavior is not as expected, you will need to review the prompt instructions or the description of the troubleshooting guide.
Potential Applications
The usage methods introduced here are just examples. By leveraging Azure MCP Server, various incident response automations become possible. For instance, the following applications can be considered:
- Regular Monitoring and Alerting: By regularly issuing instructions for log investigation, you can build a mechanism that detects anomalies and issues alerts. This leads to the identification of nuanced issues (e.g., Gray Failure[2]) that cannot be determined by metrics or log queries alone, making proactive response easier.
- Healing Proposals and Automation: You can also propose actions following a diagnosis. For example, if a VPN connection state is found to be abnormal, a mechanism could allow the LLM to automatically prompt for a VPN Gateway reset. Furthermore, since Azure MCP Server can execute Azure CLI commands, it is also possible to automatically execute a reset with user permission. This significantly shortens the time to resolution.
- Interlinking Multiple Logs: By linking multiple Azure resources, more complex troubleshooting can be automated. For example, when a network issue occurs, the status of related virtual machines or storage accounts can be checked cross-sectionally, and the cause of the problem can be identified through comprehensive judgment.
- User Interface Improvements: If this functionality is turned into a bot, it can be easily operated from chat applications like Teams or Slack. Additionally, improvements such as multi-functionalization or multi-agent systems could be considered, where the system also handles the selection of the appropriate troubleshooting guide[3].
- Automatic Generation of Troubleshooting Guides: Troubleshooting guides can be created based on human response history. If past response histories are stored in communication tools or ticket management systems, it is possible to use AI to automatically generate troubleshooting guides. This reduces the burden of creating guides and enables faster responses.
Conclusion
By leveraging the Azure MCP Server, incident response on Azure can be streamlined. In particular, a flow where you write a troubleshooting guide and have an agent execute it can be easily realized with the Azure MCP Server.
In this article, focusing primarily on S2S VPN connection troubleshooting as a sample, I introduced how to write a troubleshooting guide and its actual operation. In the demo video, the diagnosis of the incident was completed in approximately 30 seconds. However, if a human were to do this manually, even an experienced engineer would take anywhere from several minutes to over ten minutes, and including report generation, it wouldn't be surprising if it took around 30 minutes. Once the system is in place, the agent can resolve issues tens of times faster.
Such automation efforts are extremely effective in reducing the burden of incident response and achieving faster problem resolution. As the utilization of AI and MCP continues to progress, it is expected that the efficiency of incident response will be further enhanced in the future.
-
Enhancing Microsoft network reliability with AIOps and Network Infrastructure Copilot ↩︎
-
Gray Failure: The Achilles’ Heel of Cloud-Scale Systems | USENIX ↩︎
-
As various functions are added to the agent, it will become something closer to an Azure SRE Agent: Azure SRE Agent Overview (Preview) - Azure App Service | Microsoft Learn ↩︎
Discussion