iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐥

Practical Incident Response Checklist for Incident Commanders

に公開

Incident Response Checklist for Incident Commanders

Overview

This document is my personal template for incident commanders, designed to be practical for real-world scenarios. Unlike checklists for technical responders, this is structured to help the commander make quick decisions without hesitation and maintain composure while leading the response.

0. Preparation

  • Is an incident response channel (Slack/Teams) available?
  • Can an Incident Commander (IC) be assigned immediately?
  • Is there access to logs and monitoring dashboards?

1. Detection and Notification

  • Did the person who detected the incident announce it immediately?
  • Did monitoring alerts trigger?
  • Have reports from end-users been triaged?

2. Initial Response (Target: within 10 minutes)

  • Has an IC been determined? (Centralize command authority)
  • Has the scope of impact been confirmed? (Number of users/service functions)
  • Have you declared the incident level (S0–S3) based on the severity of the impact?
  • Have you reported to stakeholders according to the notification rules?

3. Interim Mitigation (Target: within 30 minutes)

  • Have interim measures been taken to stop the service impact?
    (e.g., blocking traffic / rolling back / disabling specific features)
  • Have you notified users? (Status page / SNS / internal announcement)
  • Have relevant teams been assembled?

4. Root Cause Analysis

  • Have you identified the trigger of the incident? (Code / Configuration / External factors)
  • Have you organized the timeline based on logs and metrics?
  • Have you evaluated the possibility of recurrence?

5. Recovery

  • Has the production environment returned to normal?
  • Has the impact on users been completely resolved?
  • Have the interim measures been removed?

6. Post-Incident Response (Post-mortem)

  • Have you compiled the timeline from occurrence to recovery?
  • Have you defined permanent solutions (recurrence prevention measures)?
  • Have you held a review meeting? (Covering both technical and organizational aspects)
  • Have you summarized the learnings in one sentence and registered them in the knowledge base?

Roles (RACI Mini)

  • IC: Incident Commander
  • DRV: Technical Driver
  • COMMS: Communications Lead
  • SCRIBE: Scribe

Status Results

  • ✅ CLOSE (Completed) → All items are Yes
  • ⚠️ PENDING (Issues remaining) → Root cause analysis or permanent solutions incomplete
  • ⛔ OPEN (Action required) → Recovery has not even been achieved

Discussion