iTranslated by AI
The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐥
Practical Incident Response Checklist for Incident Commanders
Incident Response Checklist for Incident Commanders
Overview
This document is my personal template for incident commanders, designed to be practical for real-world scenarios. Unlike checklists for technical responders, this is structured to help the commander make quick decisions without hesitation and maintain composure while leading the response.
0. Preparation
- Is an incident response channel (Slack/Teams) available?
- Can an Incident Commander (IC) be assigned immediately?
- Is there access to logs and monitoring dashboards?
1. Detection and Notification
- Did the person who detected the incident announce it immediately?
- Did monitoring alerts trigger?
- Have reports from end-users been triaged?
2. Initial Response (Target: within 10 minutes)
- Has an IC been determined? (Centralize command authority)
- Has the scope of impact been confirmed? (Number of users/service functions)
- Have you declared the incident level (S0–S3) based on the severity of the impact?
- Have you reported to stakeholders according to the notification rules?
3. Interim Mitigation (Target: within 30 minutes)
-
Have interim measures been taken to stop the service impact?
(e.g., blocking traffic / rolling back / disabling specific features) - Have you notified users? (Status page / SNS / internal announcement)
- Have relevant teams been assembled?
4. Root Cause Analysis
- Have you identified the trigger of the incident? (Code / Configuration / External factors)
- Have you organized the timeline based on logs and metrics?
- Have you evaluated the possibility of recurrence?
5. Recovery
- Has the production environment returned to normal?
- Has the impact on users been completely resolved?
- Have the interim measures been removed?
6. Post-Incident Response (Post-mortem)
- Have you compiled the timeline from occurrence to recovery?
- Have you defined permanent solutions (recurrence prevention measures)?
- Have you held a review meeting? (Covering both technical and organizational aspects)
- Have you summarized the learnings in one sentence and registered them in the knowledge base?
Roles (RACI Mini)
- IC: Incident Commander
- DRV: Technical Driver
- COMMS: Communications Lead
- SCRIBE: Scribe
Status Results
- ✅ CLOSE (Completed) → All items are Yes
- ⚠️ PENDING (Issues remaining) → Root cause analysis or permanent solutions incomplete
- ⛔ OPEN (Action required) → Recovery has not even been achieved
Discussion