incident-management
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
operations incidentson-callrunbookspost-mortemsstatus-pageswar-roomsWhat is incident-management?
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
incident-management
incident-management is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms.
Quick Facts
| Field | Value |
|---|---|
| Category | operations |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management- The incident-management skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Incident management is the structured practice of detecting, responding to, resolving, and learning from production failures. It spans the full incident lifecycle - from the moment an alert fires through war room coordination, customer communication via status pages, and the post-mortem that prevents recurrence. This skill provides actionable frameworks for each phase: on-call rotation design, runbook authoring, severity classification, war room protocols, status page communication, and blameless post-mortems. Built for engineering teams that want to move from chaotic firefighting to repeatable, calm incident response.
Tags
incidents on-call runbooks post-mortems status-pages war-rooms
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair incident-management with these complementary skills:
Frequently Asked Questions
What is incident-management?
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
How do I install incident-management?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support incident-management?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Incident Management
Incident management is the structured practice of detecting, responding to, resolving, and learning from production failures. It spans the full incident lifecycle - from the moment an alert fires through war room coordination, customer communication via status pages, and the post-mortem that prevents recurrence. This skill provides actionable frameworks for each phase: on-call rotation design, runbook authoring, severity classification, war room protocols, status page communication, and blameless post-mortems. Built for engineering teams that want to move from chaotic firefighting to repeatable, calm incident response.
When to use this skill
Trigger this skill when the user:
- Needs to design or improve an on-call rotation or escalation policy
- Wants to write, review, or templatize a runbook for an alert or service
- Is conducting, writing, or facilitating a post-mortem / post-incident review
- Needs to set up or improve a status page and customer communication strategy
- Is running or setting up a war room for an active incident
- Wants to define severity levels or incident classification criteria
- Needs an incident commander playbook or role definitions
- Is building incident response tooling or automation
Do NOT trigger this skill for:
- Defining SLOs, SLIs, or error budgets without an incident context (use site-reliability skill)
- Infrastructure provisioning or deployment pipeline design (use CI/CD or cloud skills)
Key principles
Incidents are system failures, not people failures - Every incident reflects a gap in the system: missing automation, insufficient monitoring, unclear runbooks, or architectural fragility. Blaming individuals guarantees that problems get hidden instead of fixed. Design every process around surfacing systemic issues.
Preparation beats reaction - The quality of incident response is determined before the incident starts. Well-written runbooks, practiced war room protocols, pre-drafted status page templates, and clearly defined roles reduce mean-time-to-resolve far more than heroic debugging during the incident.
Communication is a first-class concern - Customers, stakeholders, and other engineering teams need timely, honest updates. A status page update every 30 minutes during an outage builds trust. Silence destroys it. Assign a dedicated communications role in every major incident.
Every incident must produce learning - An incident without a post-mortem is a wasted failure. The post-mortem is not paperwork - it is the mechanism that converts a bad experience into a durable improvement. Action items without owners and deadlines are wishes, not commitments.
On-call must be sustainable - Unsustainable on-call leads to burnout, attrition, and slower incident response. Track on-call load metrics, enforce rest periods, and treat excessive paging as a reliability problem to fix, not a cost of doing business.
Core concepts
Incident lifecycle
Detection -> Triage -> Response -> Resolution -> Post-mortem -> Prevention
| | | | | |
Alerts Severity War room Fix/rollback Review + Action
fire assigned stands up deployed learn items
trackedEvery phase has a defined owner, a set of artifacts, and a handoff to the next phase. Gaps between phases - especially between resolution and post-mortem - are where learning gets lost.
Incident roles
| Role | Responsibility | When assigned |
|---|---|---|
| Incident Commander (IC) | Owns the response, delegates work, makes decisions | SEV1/SEV2 immediately |
| Communications Lead | Updates status page, stakeholders, and support teams | SEV1/SEV2 immediately |
| Technical Lead | Drives root cause investigation and fix implementation | All severities |
| Scribe | Maintains the incident timeline in real-time | SEV1; optional for SEV2 |
Role assignment rule: For SEV1, all four roles must be filled within 15 minutes. For SEV2, IC and Technical Lead are mandatory. For SEV3+, the on-call engineer handles all roles.
Severity classification
| Severity | Customer impact | Response time | War room | Status page |
|---|---|---|---|---|
| SEV1 | Complete outage or data loss | Page immediately, 5-min ack | Required | Required |
| SEV2 | Degraded core functionality | Page on-call, 15-min ack | Recommended | Required |
| SEV3 | Minor degradation, workaround exists | Next business day | No | Optional |
| SEV4 | Cosmetic or internal-only | Backlog | No | No |
Escalation rule: If a SEV2 is not mitigated within 60 minutes, escalate to SEV1 procedures. If the on-call engineer cannot classify severity within 10 minutes, default to SEV2 until more information is available.
Common tasks
Design an on-call rotation
Rotation structure:
Primary on-call: First responder. Acks within 5 min (SEV1) or 15 min (SEV2).
Secondary on-call: Backup if primary misses ack window. Auto-escalated by pager.
Manager escalation: If both primary and secondary miss ack. Also for SEV1 war rooms.Scheduling guidelines:
- Rotate weekly. Never assign the same person two consecutive weeks without a gap.
- Minimum team size for sustainable on-call: 5 engineers (allows 1-in-5 rotation).
- Follow-the-sun for distributed teams: hand off to the next timezone instead of paging at 3am. Each region covers business hours + 2 hours buffer.
- Provide comp time or additional pay for after-hours pages. Track and review quarterly.
On-call health metrics:
| Metric | Healthy | Unhealthy |
|---|---|---|
| Pages per on-call week | < 5 | > 10 |
| After-hours pages per week | < 2 | > 5 |
| Mean time-to-ack (SEV1) | < 5 min | > 15 min |
| Mean time-to-ack (SEV2) | < 15 min | > 30 min |
| Percentage of pages with runbooks | > 80% | < 50% |
Write a runbook
Every runbook must contain these sections:
Title: [Alert name] - [Service name] Runbook
Last updated: [date]
Owner: [team or individual]
1. SYMPTOM
What the alert tells you. Quote the alert condition verbatim.
2. IMPACT
Who is affected. Severity level. Business impact in plain language.
3. INVESTIGATION STEPS
Numbered steps. Each step has:
- What to check (command, dashboard link, or query)
- What a normal result looks like
- What an abnormal result means and what to do next
4. MITIGATION STEPS
Numbered steps to stop the bleeding. Prioritize speed over elegance.
Include rollback commands, feature flag toggles, and traffic shift procedures.
5. ESCALATION
Who to contact if steps 3-4 do not resolve the issue within [N] minutes.
Include name, team, and pager handle.
6. CONTEXT
Links to: service architecture doc, relevant dashboards, past incidents,
and the service's on-call schedule.Runbook quality test: A new team member who has never seen this service should be able to follow the runbook and either resolve the issue or escalate correctly within 30 minutes.
Conduct a post-mortem
When to hold one: Every SEV1. Every SEV2 with customer impact. Any incident consuming more than 4 hours of engineering time. Recurring SEV3s from the same cause.
Timeline:
Hour 0: Incident resolved. IC assigns post-mortem owner.
Day 1: Owner drafts timeline and initial analysis.
Day 2-3: Facilitated post-mortem meeting (60-90 minutes).
Day 3-4: Draft published for 24-hour review period.
Day 5: Final version published. Action items entered in tracker.
Day 30: Action item review - are they done?The five post-mortem questions:
- What happened? (factual timeline with timestamps)
- Why did it happen? (root cause analysis - use the "five whys" technique)
- Why was it not detected sooner? (monitoring and alerting gap)
- What slowed down the response? (process and tooling gap)
- What prevents recurrence? (action items)
Action item rules: Every action item must have an owner, a due date, a priority (P0/P1/P2), and a measurable definition of done. "Improve monitoring" is not an action item. "Add latency p99 alert for checkout-api with a 500ms threshold, owned by @alice, due 2026-04-01" is.
See references/postmortem-template.md for the full template.
Set up a status page
Page structure:
Components:
- Group by user-facing service (API, Dashboard, Mobile App, Webhooks)
- Each component has a status: Operational | Degraded | Partial Outage | Major Outage
- Show uptime percentage over 90 days per component
Incidents:
- Title: clear, customer-facing description (not internal jargon)
- Updates: timestamped entries showing investigation progress
- Resolution: what was fixed and what customers need to do (if anything)
Maintenance:
- Scheduled windows with start/end times in customer's timezone
- Description of impact during the windowCommunication cadence during incidents:
| Phase | Update frequency | Content |
|---|---|---|
| Investigating | Every 30 min | "We are aware and investigating" + symptoms |
| Identified | Every 30 min | Root cause identified, ETA if known |
| Monitoring | Every 60 min | Fix deployed, monitoring for stability |
| Resolved | Once | Summary of what happened and what was fixed |
Writing rules for status updates:
- Use plain language. No internal service names, error codes, or jargon.
- State the customer impact first, then what you are doing about it.
- Never say "no impact" if customers reported problems.
- Include timezone in all timestamps.
Run a war room
War room activation criteria: Any SEV1. Any SEV2 not mitigated within 30 minutes. Any incident affecting multiple services or teams.
War room protocol:
Minute 0-5: IC opens the war room (video call + shared channel).
IC states: incident summary, current severity, affected services.
IC assigns roles: Communications Lead, Technical Lead, Scribe.
Minute 5-15: Technical Lead drives initial investigation.
Scribe starts the timeline document.
Communications Lead posts first status page update.
Every 15 min: IC runs a checkpoint:
- "What do we know now?"
- "What are we trying next?"
- "Do we need to escalate or bring in more people?"
- "Is the status page current?"
Resolution: IC confirms the fix is deployed and metrics are recovering.
Communications Lead posts resolution update.
IC schedules the post-mortem and assigns an owner.
War room closed.War room rules:
- One conversation at a time. IC moderates.
- No side investigations without telling the IC.
- All commands run against production are announced before execution.
- The scribe logs every significant action with a timestamp.
- If the war room exceeds 2 hours, IC rotates or brings a fresh IC.
Build an escalation policy
Escalation ladder:
Level 0: Automated response (auto-restart, auto-scale, circuit breaker)
Level 1: On-call engineer (primary)
Level 2: On-call engineer (secondary) + team lead
Level 3: Engineering manager + dependent service on-calls
Level 4: Director/VP + incident commander (SEV1 only)Escalation triggers:
| Trigger | Action |
|---|---|
| Primary on-call does not ack within 5 min (SEV1) | Auto-page secondary |
| No mitigation progress after 30 min | Escalate one level |
| Customer-reported incident (not alert-detected) | Escalate one level immediately |
| Incident spans multiple services | Page all affected service on-calls |
| Data loss suspected | Immediate SEV1, escalate to Level 4 |
Anti-patterns / common mistakes
| Mistake | Why it is wrong | What to do instead |
|---|---|---|
| No runbooks for alerts | Every page becomes an investigation from scratch; MTTR skyrockets | Treat "alert without runbook" as a blocking issue; write the runbook during the incident |
| Blameful post-mortems | Engineers hide mistakes, avoid risk, and stop reporting near-misses | Use a blameless template; explicitly ban naming individuals as root causes |
| Status page updates only at resolution | Customers assume you do not know or do not care; support tickets flood in | Update every 30 minutes minimum; assign a dedicated Communications Lead |
| On-call without compensation or rotation limits | Burnout, attrition, and degraded response quality | Cap rotations, provide comp time, track health metrics quarterly |
| War rooms without an Incident Commander | Multiple people investigate the same thing, no one communicates, chaos | Always assign an IC first; the IC's job is coordination, not debugging |
| Post-mortem action items with no owner or deadline | Items rot in a document; the same incident repeats | Every action item needs: owner, due date, priority, and definition of done |
Gotchas
Severity escalation delays compound MTTR - The most common cause of a 2-hour incident that should have taken 30 minutes is a 45-minute delay in escalating from SEV3 to SEV2. The escalation rule "if no mitigation progress after 30 minutes, escalate one level" is not optional - build it into your pager escalation policy as an automatic trigger, not a judgment call.
Post-mortem action items decay without a 30-day review - Action items written in the heat of post-mortem often get deprioritized as new features take over the sprint. Without a mandatory 30-day follow-up meeting with the IC and action item owners, the same incident repeats within 6 months. Treat action item review as a blocking ceremony, not a nice-to-have.
Status page updates that use internal jargon erode customer trust - Saying "the Kafka consumer group is lagging due to a partition rebalance" confuses customers and implies you don't know how to communicate. Customers need to know the symptom they're experiencing, whether you're aware, and when you expect resolution. Translate everything to user impact before posting.
War rooms without a single Incident Commander devolve into chaos - When multiple senior engineers simultaneously investigate, propose fixes, and run commands against production without coordination, changes step on each other and the true root cause gets masked by noise. The IC role is not debugging - it is traffic control. Assign an IC before anyone runs a single query.
Runbooks that haven't been tested under stress are not runbooks - A runbook that works when you write it (calm, familiar with the system, full context) may be unusable at 3am by a tired on-call engineer seeing the service for the first time. Run fire drills where engineers who didn't write the runbook follow it end-to-end. Gaps in instructions surface immediately.
References
For detailed guidance on specific incident management domains, load the relevant
file from references/:
references/postmortem-template.md- full blameless post-mortem template with example entries, facilitation guide, and action item tracker formatreferences/runbook-template.md- detailed runbook template with example investigation steps and mitigation proceduresreferences/status-page-guide.md- status page setup guide with communication templates and incident update examplesreferences/war-room-checklist.md- war room activation checklist, role cards, and checkpoint script
Only load a references file when the current task requires it.
References
postmortem-template.md
Post-mortem Template
Document header
Title: [SEV level] [Brief description of the incident]
Date: [Date of incident]
Duration: [Start time - End time, including timezone]
Authors: [Post-mortem owner]
Status: Draft | In Review | Final
Severity: SEV1 | SEV2 | SEV3
Services affected: [List of affected services]
Customer impact: [Brief description of user-facing impact]1. Summary
Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be readable by a non-engineer.
Example:
On 2026-03-10 between 14:22 and 15:47 UTC, the checkout service returned 500 errors for approximately 30% of payment requests. An estimated 2,400 customers were unable to complete purchases during this window. The root cause was a connection pool exhaustion triggered by a configuration change deployed at 14:15 UTC. The incident was resolved by rolling back the configuration change and increasing the connection pool size.
2. Timeline
Use UTC timestamps. Include both automated events (alerts, deploys) and human actions (who did what).
14:15 UTC - Deploy #4521 pushed to production (config change to DB pool settings)
14:22 UTC - Checkout-api error rate alert fires (threshold: 1%, observed: 8%)
14:24 UTC - On-call engineer @alice acks the page
14:27 UTC - @alice opens war room, assigns IC role to @bob
14:30 UTC - Status page updated: "Investigating increased errors on checkout"
14:35 UTC - @alice identifies connection pool exhaustion in service metrics
14:40 UTC - @alice correlates with deploy #4521 timeline
14:45 UTC - Decision: rollback deploy #4521
14:48 UTC - Rollback initiated
14:55 UTC - Rollback complete. Error rate dropping.
15:00 UTC - Status page updated: "Fix deployed, monitoring"
15:30 UTC - Error rate back to baseline (0.05%)
15:47 UTC - IC @bob declares incident resolved
15:47 UTC - Status page updated: "Resolved"3. Root cause analysis
What happened
Describe the technical chain of events. Be specific about the failure mode.
Five whys
Why 1: Why did checkout fail?
-> Connection pool was exhausted; new requests could not get a DB connection.
Why 2: Why was the connection pool exhausted?
-> Deploy #4521 reduced max_connections from 100 to 10.
Why 3: Why was that configuration change deployed?
-> An engineer was tuning connection settings for a staging environment
and accidentally included the production config file.
Why 4: Why did the production config get included?
-> The staging and production configs are in the same directory with
similar names (db-config-staging.yaml, db-config-prod.yaml).
Why 5: Why was there no safeguard?
-> No automated validation checks connection pool size against a minimum
threshold before deploy.Contributing factors
List factors that did not cause the incident but made it worse or slower to resolve:
- No deployment diff review required for config-only changes
- Connection pool metric was not on the checkout service dashboard
- Runbook for this alert did not mention checking recent deploys
4. Detection analysis
| Question | Answer |
|---|---|
| How was the incident detected? | Automated alert on error rate |
| Time from cause to detection | 7 minutes |
| Could we have detected it sooner? | Yes - a config validation check at deploy time would have caught it instantly |
| Were there earlier signals we missed? | Connection pool utilization was at 95% for 5 minutes before errors started, but no alert was configured for pool saturation |
5. Response analysis
| Question | Answer |
|---|---|
| Time from detection to ack | 2 minutes |
| Time from ack to mitigation start | 21 minutes |
| Time from mitigation start to resolution | 62 minutes |
| What went well in the response? | Fast ack, war room opened quickly, status page updated promptly |
| What could have been faster? | Correlating the deploy with the outage took 13 minutes; an automated deploy correlation tool would have flagged it immediately |
6. Impact assessment
| Dimension | Measurement |
|---|---|
| Duration | 85 minutes |
| Users affected | ~2,400 (30% of checkout traffic) |
| Revenue impact | Estimated $18,000 in delayed purchases (95% recovered within 2 hours) |
| SLO budget consumed | 12% of monthly error budget |
| Support tickets | 47 tickets opened |
| Data loss | None |
7. Action items
Every action item must have: owner, due date, priority, and definition of done.
| ID | Action item | Owner | Priority | Due date | Status |
|---|---|---|---|---|---|
| AI-1 | Add config validation to deploy pipeline: reject connection pool size < 20 | @charlie | P0 | 2026-03-24 | Open |
| AI-2 | Separate staging and production config directories | @alice | P1 | 2026-04-07 | Open |
| AI-3 | Add connection pool utilization alert (threshold: 80%) to checkout-api | @alice | P1 | 2026-03-28 | Open |
| AI-4 | Update checkout-api runbook to include "check recent deploys" as step 2 | @bob | P2 | 2026-03-21 | Open |
| AI-5 | Evaluate automated deploy-correlation tool for the incident dashboard | @dave | P2 | 2026-04-14 | Open |
8. Lessons learned
What went well
- Alert fired quickly (7 minutes from cause)
- War room was organized and focused
- Status page was updated within 8 minutes of the page
- Rollback was clean and effective
What did not go well
- Config change had no validation gate
- It took 13 minutes to identify the deploy as the cause
- The runbook did not mention checking recent deployments
Where we got lucky
- The config change was easily rollbackable. A schema migration with the same type of error would have been much harder to reverse.
Facilitation guide
Before the meeting
- Post-mortem owner drafts sections 1-3 before the meeting
- Share the draft with all participants 2 hours before the meeting
- Remind everyone: this is a blameless review. We discuss systems, not individuals.
During the meeting (60-90 minutes)
- (5 min) IC reads the summary and timeline aloud
- (20 min) Walk through the root cause analysis. Ask: "Is there anything missing?"
- (15 min) Review detection and response. Ask: "Where could we have been faster?"
- (20 min) Draft action items as a group. For each: agree on owner and priority.
- (10 min) Capture lessons learned. Ask: "What went well? What did not?"
- (5 min) Agree on review date for action items (usually 30 days)
After the meeting
- Owner finalizes the document within 24 hours
- Publish to the team's incident archive
- Enter all action items in the issue tracker with the agreed due dates
- Schedule a 30-day follow-up to verify action item completion
runbook-template.md
Runbook Template
Standard runbook structure
Every runbook follows the same six-section format. Consistency across runbooks means on-call engineers can find information in the same place every time, even for services they have never seen before.
Template
# [Alert Name] - [Service Name] Runbook
**Last updated:** [YYYY-MM-DD]
**Owner:** [Team or individual responsible for this runbook]
**Alert source:** [Monitoring tool and alert ID/link]
**Related services:** [Upstream and downstream dependencies]
---
## 1. SYMPTOM
[Quote the alert condition verbatim. Include the metric, threshold, and window.]
Example:
> Alert: checkout-api-error-rate
> Condition: HTTP 5xx rate > 1% for 5 minutes
> Dashboard: [link to dashboard]
---
## 2. IMPACT
**Who is affected:** [Customer segment or internal users]
**How they are affected:** [Cannot checkout, see errors, experience slowness]
**Severity:** [SEV1/SEV2/SEV3 - reference the severity classification table]
**Business impact:** [Revenue, data integrity, compliance, reputation]
---
## 3. INVESTIGATION STEPS
Follow these steps in order. At each step, the result tells you where to go next.
### Step 1: Check the dashboard
- Open: [dashboard link]
- Normal: Error rate < 0.1%, latency p99 < 300ms
- Abnormal: If error rate is elevated, proceed to Step 2
### Step 2: Check recent deployments
- Run: `kubectl rollout history deployment/checkout-api -n production`
- Or check: [deploy tool link]
- If a deploy happened in the last 30 minutes, this is likely the cause.
Go to Mitigation Step A (Rollback).
### Step 3: Check downstream dependencies
- Open: [dependency dashboard link]
- Check: database connection pool, payment gateway status, cache hit rate
- If a dependency is degraded, the issue is upstream. Escalate to that
service's on-call (see Escalation section).
### Step 4: Check resource utilization
- Run: `kubectl top pods -n production -l app=checkout-api`
- Normal: CPU < 70%, Memory < 80%
- If resources are exhausted, go to Mitigation Step B (Scale).
### Step 5: Check application logs
- Run: `kubectl logs -l app=checkout-api -n production --tail=200 | grep ERROR`
- Or query: [log aggregator link with pre-built query]
- Look for: stack traces, connection refused, timeout errors
- If you see a new error pattern, document it and escalate.
---
## 4. MITIGATION STEPS
### Step A: Rollback the last deployment
```bash
kubectl rollout undo deployment/checkout-api -n productionMonitor error rate for 10 minutes. If it returns to baseline, the deploy was the cause. Document and proceed to post-mortem.
Step B: Scale the service
kubectl scale deployment/checkout-api -n production --replicas=10Monitor for 5 minutes. If error rate drops, the issue is capacity-related. Investigate the traffic spike source.
Step C: Restart pods (last resort)
kubectl rollout restart deployment/checkout-api -n productionUse only if Steps A and B did not help and you suspect a memory leak or stuck process. This causes brief service disruption during rolling restart.
Step D: Toggle feature flag (if applicable)
- Open: [feature flag tool link]
- Disable: [flag name] for production environment
- This removes the most recent feature change without a full rollback.
5. ESCALATION
If the above steps do not resolve the issue within 30 minutes, escalate:
| Priority | Contact | How to reach |
|---|---|---|
| First | [Service team on-call] | Page via [pager tool] |
| Second | [Team lead / engineering manager] | Page via [pager tool] |
| Third | [Dependent service on-call] | Page via [pager tool] - use for dependency issues |
| SEV1 | [Director / VP Engineering] | Phone: [number] |
6. CONTEXT
- Architecture doc: [link]
- Service dashboard: [link]
- Dependency map: [link]
- Past incidents:
- [INC-1234] - Similar error spike caused by config change (2026-01)
- [INC-1189] - Database failover caused checkout errors (2025-11)
- On-call schedule: [link]
- Deployment pipeline: [link]
---
## Runbook quality checklist
Before publishing a runbook, verify:
- [ ] Alert condition is quoted verbatim with metric name and threshold
- [ ] Impact section states who is affected in plain language
- [ ] Every investigation step has a "normal" and "abnormal" result
- [ ] Mitigation steps include actual commands or tool links, not just descriptions
- [ ] Escalation contacts are current (review quarterly)
- [ ] Context links are not broken
- [ ] A new team member can follow this without prior service knowledge
## Runbook maintenance
- **Review cadence:** Every runbook must be reviewed every 90 days
- **Update triggers:** Any incident where the runbook was incomplete or wrong
- **Ownership:** The team that owns the service owns the runbook
- **Testing:** During on-call onboarding, have new engineers walk through runbooks
for the top 5 most-paged alerts as a tabletop exercise
## Common investigation commands reference
```bash
# Kubernetes - check pod status
kubectl get pods -n production -l app=SERVICE_NAME
# Kubernetes - check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | head -20
# Kubernetes - check resource usage
kubectl top pods -n production -l app=SERVICE_NAME
# Kubernetes - check rollout status
kubectl rollout status deployment/SERVICE_NAME -n production
# Kubernetes - view recent logs
kubectl logs -l app=SERVICE_NAME -n production --tail=100 --since=10m
# Database - check active connections (PostgreSQL)
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
# Database - check long-running queries (PostgreSQL)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC; status-page-guide.md
Status Page Guide
Status page structure
Components
Organize by user-facing service, not internal architecture. Customers do not care which microservice is down - they care what they cannot do.
Good component names:
- Checkout & Payments
- User Dashboard
- API (v2)
- Mobile App
- Webhooks & Notifications
- Data Exports
Bad component names (internal jargon):
- payment-gateway-service
- redis-cache-cluster
- kafka-consumer-group-3
Component statuses
| Status | Meaning | When to use |
|---|---|---|
| Operational | Everything working normally | Default state |
| Degraded Performance | Slower than normal but functional | Elevated latency, partial slowdown |
| Partial Outage | Some users or features affected | Errors for a subset of requests |
| Major Outage | Service is unavailable | Complete failure of a core function |
| Under Maintenance | Planned downtime | Scheduled maintenance windows |
Incident update templates
Investigating
Title: Elevated error rates on [Component]
Update (HH:MM UTC):
We are investigating reports of [symptom in plain language]. Some customers
may experience [specific impact - e.g., "errors when attempting to check out"
or "slower page load times on the dashboard"].
We will provide an update within 30 minutes.Identified
Update (HH:MM UTC):
We have identified the cause of [symptom]. [One sentence about the cause in
plain language - e.g., "A configuration change is causing connection issues
with our payment processor."]
Our engineering team is working on a fix. We expect to have this resolved
within [estimated time if known, or "the next 1-2 hours"].
We will provide another update within 30 minutes.Monitoring
Update (HH:MM UTC):
We have deployed a fix for [symptom]. Our systems are recovering and we are
monitoring to confirm the issue is fully resolved.
[If applicable: "Some customers may continue to see intermittent errors for
the next 10-15 minutes as the fix propagates."]
We will provide a final update once we confirm full recovery.Resolved
Update (HH:MM UTC):
This incident has been resolved. [Component] is operating normally.
Summary: Between [start time] and [end time] UTC, [brief description of what
happened and who was affected]. [One sentence on what was done to fix it.]
[If applicable: "No customer action is required." or "If you experienced
[specific issue], please [specific action - e.g., retry your request, contact
support at support@example.com]."]
We will be conducting a thorough post-incident review to prevent recurrence.
We apologize for the disruption.Maintenance notification templates
Scheduled maintenance (advance notice)
Title: Scheduled maintenance for [Component]
We will be performing scheduled maintenance on [Component] on [date] from
[start time] to [end time] UTC ([convert to major customer timezones]).
During this window:
- [Specific impact - e.g., "The API will return 503 errors"]
- [What will still work - e.g., "The dashboard will remain accessible in
read-only mode"]
- [Estimated duration of actual downtime within the window]
[If applicable: "We recommend scheduling any critical operations before or
after this maintenance window."]
We will update this notice when maintenance begins and when it is complete.Communication cadence
| Incident phase | Update frequency | Who writes |
|---|---|---|
| Investigating | Every 30 minutes | Communications Lead |
| Identified | Every 30 minutes | Communications Lead |
| Monitoring | Every 60 minutes | Communications Lead |
| Resolved | Once (final update) | Communications Lead + IC review |
Rules:
- Never go more than 30 minutes without an update during an active incident
- If there is no new information, say so: "We are continuing to investigate. No new information at this time. Next update in 30 minutes."
- All timestamps in UTC with local timezone equivalents for major customer regions
- The IC reviews the resolved update before publishing
Writing guidelines
Do
- State the customer impact first, then what you are doing
- Use plain language a non-technical person can understand
- Be honest about what you know and do not know
- Include specific times and durations
- Acknowledge the disruption: "We apologize for the inconvenience"
Do not
- Use internal service names, error codes, or technical jargon
- Say "no impact" if customers are reporting problems
- Blame third parties without confirmation ("our cloud provider caused...")
- Promise specific resolution times unless you are confident
- Use passive voice to hide accountability ("errors were experienced")
Tone
- Professional but human
- Direct and factual
- Empathetic without being overly apologetic
- Confident about what you know, transparent about what you do not
Status page tool setup checklist
When setting up a new status page:
- Define components based on user-facing services (not internal architecture)
- Set up subscriber notifications (email, SMS, webhook, RSS)
- Configure automated status updates from monitoring (operational/degraded)
- Pre-draft incident templates (copy from this guide)
- Assign ownership: who can publish updates (on-call + Communications Lead)
- Test the notification flow end-to-end before going live
- Add the status page link to your application's footer and support docs
- Configure maintenance window scheduling
- Set up uptime metrics display (90-day rolling window per component)
- Review and update component list quarterly as services evolve
war-room-checklist.md
War Room Checklist
Activation criteria
Open a war room when any of these conditions are met:
- SEV1 incident declared
- SEV2 not mitigated within 30 minutes
- Incident affects multiple services or teams
- Incident has customer-visible impact and no clear cause after 15 minutes
- Incident commander requests a war room for any reason
War room activation checklist (first 5 minutes)
The person who activates the war room (usually the on-call engineer or IC) runs through this checklist:
- Open a dedicated video call (use the team's standing war room link)
- Create or identify the incident channel (e.g., #inc-2026-03-14-checkout)
- Post the incident summary in the channel:
INCIDENT ACTIVE Severity: [SEV1/SEV2] Summary: [One sentence - what is broken] Impact: [Who is affected and how] Started: [HH:MM UTC] IC: [@name] War room: [video call link] - Assign roles (see Role Cards below)
- Confirm all role holders have joined the war room
- Scribe creates the incident timeline document
Role cards
Incident Commander (IC)
Primary job: Coordinate the response. You are NOT debugging.
Checklist:
- Confirm severity classification
- Assign all roles (Communications Lead, Technical Lead, Scribe)
- Run 15-minute checkpoints (see Checkpoint Script)
- Make escalation decisions
- Approve rollback or mitigation actions
- Decide when to declare resolution
- Assign post-mortem owner before closing the war room
Rules:
- Do not investigate the issue yourself. Delegate.
- If you are also the most qualified person to debug, hand off IC to someone else.
- If the war room exceeds 2 hours, rotate IC or bring in a fresh one.
Communications Lead
Primary job: Keep customers, stakeholders, and support informed.
Checklist:
- Post first status page update within 10 minutes of war room opening
- Update status page every 30 minutes (or sooner if there is new information)
- Notify internal stakeholders (support team, account managers for affected customers)
- Draft the resolution update for IC review before publishing
- Compile a list of customer reports or support tickets for the post-mortem
Rules:
- Use the templates from
references/status-page-guide.md - Every update must be reviewed by IC before publishing (exception: "still investigating" updates)
- Never share internal details, blame, or unconfirmed root causes externally
Technical Lead
Primary job: Drive the investigation and fix.
Checklist:
- Follow the relevant runbook (if one exists)
- Announce investigation steps before executing them
- Report findings to IC at each checkpoint
- Propose mitigation options with trade-offs
- Execute the approved fix
- Confirm metrics are recovering after the fix
Rules:
- Announce all production commands before running them
- If the runbook does not cover this scenario, say so immediately
- If you need help, tell the IC. Do not silently struggle.
Scribe
Primary job: Maintain a real-time timeline of the incident.
Checklist:
- Create the timeline document (use the team's incident template)
- Log every significant action with a UTC timestamp
- Log who did what (not just what happened)
- Log decisions and the reasoning behind them
- Log things that were tried but did not work
- At resolution, hand the timeline to the post-mortem owner
Rules:
- Capture facts, not interpretations
- If something is unclear, ask for clarification and log the answer
- The timeline is the primary input for the post-mortem - completeness matters
Checkpoint script (every 15 minutes)
The IC runs this script at each checkpoint. Read it aloud:
CHECKPOINT - [HH:MM UTC]
1. STATUS CHECK
"Technical Lead: What do we know now that we didn't know 15 minutes ago?"
2. NEXT STEPS
"What are we trying next? Who is doing it?"
3. ESCALATION CHECK
"Do we need to bring in anyone else? Any dependent teams?"
4. COMMUNICATIONS CHECK
"Communications Lead: Is the status page current? When is the next update due?"
5. TIMELINE CHECK
"Scribe: Are we capturing everything? Anything to add?"
6. SEVERITY CHECK
"Has the severity changed? Should we escalate or de-escalate?"War room rules
Post these rules in the incident channel at the start of every war room:
WAR ROOM RULES
1. One conversation at a time. IC moderates.
2. Announce all production commands BEFORE running them.
3. No side investigations without telling the IC.
4. If you join late, read the timeline first. Do not ask "what happened?"
5. Mute when not speaking (video call).
6. Keep the channel for incident discussion only. Use threads for tangents.
7. If you do not have a role, observe silently unless asked.War room closure checklist
When the IC declares the incident resolved:
- Confirm metrics have returned to baseline for at least 15 minutes
- Communications Lead posts the resolution update on the status page
- IC assigns a post-mortem owner and sets a deadline (within 48 hours)
- Scribe finalizes the timeline and shares it with the post-mortem owner
- IC posts a summary in the incident channel:
INCIDENT RESOLVED Duration: [X hours Y minutes] Root cause: [One sentence] Fix applied: [One sentence] Post-mortem owner: [@name] Post-mortem deadline: [date] - IC thanks everyone who participated
- Close the war room video call
- Archive the incident channel (do not delete - it is a historical record)
War room anti-patterns
| Anti-pattern | Why it is harmful | What to do instead |
|---|---|---|
| IC also debugging | Coordination stops, chaos increases | IC delegates all investigation |
| No scribe | Post-mortem has no accurate timeline | Always assign a scribe, even for SEV2 |
| Side conversations | IC loses track of what is happening | Enforce "one conversation" rule |
| Heroic solo debugging | Others cannot help or learn; single point of failure | Announce all actions; pair on investigation |
| No checkpoints | Investigation drifts; people work on the wrong thing | IC runs checkpoint script every 15 minutes |
| War room stays open after resolution | Fatigue, wasted time | Close promptly once metrics are stable |
Frequently Asked Questions
What is incident-management?
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
How do I install incident-management?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support incident-management?
incident-management works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.