incident-management

Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.

What is incident-management?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The incident-management skill is now active and ready to use

Overview Files

incident-management

incident-management is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms.

Quick Facts

Field	Value
Category	operations
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management

The incident-management skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

Incident management is the structured practice of detecting, responding to, resolving, and learning from production failures. It spans the full incident lifecycle - from the moment an alert fires through war room coordination, customer communication via status pages, and the post-mortem that prevents recurrence. This skill provides actionable frameworks for each phase: on-call rotation design, runbook authoring, severity classification, war room protocols, status page communication, and blameless post-mortems. Built for engineering teams that want to move from chaotic firefighting to repeatable, calm incident response.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair incident-management with these complementary skills:

Frequently Asked Questions

What is incident-management?

How do I install incident-management?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support incident-management?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Incident Management

When to use this skill

Trigger this skill when the user:

Needs to design or improve an on-call rotation or escalation policy
Wants to write, review, or templatize a runbook for an alert or service
Is conducting, writing, or facilitating a post-mortem / post-incident review
Needs to set up or improve a status page and customer communication strategy
Is running or setting up a war room for an active incident
Wants to define severity levels or incident classification criteria
Needs an incident commander playbook or role definitions
Is building incident response tooling or automation

Do NOT trigger this skill for:

Defining SLOs, SLIs, or error budgets without an incident context (use site-reliability skill)
Infrastructure provisioning or deployment pipeline design (use CI/CD or cloud skills)

Key principles

Incidents are system failures, not people failures - Every incident reflects a gap in the system: missing automation, insufficient monitoring, unclear runbooks, or architectural fragility. Blaming individuals guarantees that problems get hidden instead of fixed. Design every process around surfacing systemic issues.
Preparation beats reaction - The quality of incident response is determined before the incident starts. Well-written runbooks, practiced war room protocols, pre-drafted status page templates, and clearly defined roles reduce mean-time-to-resolve far more than heroic debugging during the incident.
Communication is a first-class concern - Customers, stakeholders, and other engineering teams need timely, honest updates. A status page update every 30 minutes during an outage builds trust. Silence destroys it. Assign a dedicated communications role in every major incident.
Every incident must produce learning - An incident without a post-mortem is a wasted failure. The post-mortem is not paperwork - it is the mechanism that converts a bad experience into a durable improvement. Action items without owners and deadlines are wishes, not commitments.
On-call must be sustainable - Unsustainable on-call leads to burnout, attrition, and slower incident response. Track on-call load metrics, enforce rest periods, and treat excessive paging as a reliability problem to fix, not a cost of doing business.

Core concepts

Incident lifecycle

Detection -> Triage -> Response -> Resolution -> Post-mortem -> Prevention
     |           |          |            |              |              |
  Alerts     Severity   War room     Fix/rollback   Review +       Action
  fire       assigned   stands up    deployed       learn          items
                                                                   tracked

Every phase has a defined owner, a set of artifacts, and a handoff to the next phase. Gaps between phases - especially between resolution and post-mortem - are where learning gets lost.

Incident roles

Role	Responsibility	When assigned
Incident Commander (IC)	Owns the response, delegates work, makes decisions	SEV1/SEV2 immediately
Communications Lead	Updates status page, stakeholders, and support teams	SEV1/SEV2 immediately
Technical Lead	Drives root cause investigation and fix implementation	All severities
Scribe	Maintains the incident timeline in real-time	SEV1; optional for SEV2

Role assignment rule: For SEV1, all four roles must be filled within 15 minutes. For SEV2, IC and Technical Lead are mandatory. For SEV3+, the on-call engineer handles all roles.

Severity classification

Severity	Customer impact	Response time	War room	Status page
SEV1	Complete outage or data loss	Page immediately, 5-min ack	Required	Required
SEV2	Degraded core functionality	Page on-call, 15-min ack	Recommended	Required
SEV3	Minor degradation, workaround exists	Next business day	No	Optional
SEV4	Cosmetic or internal-only	Backlog	No	No

Escalation rule: If a SEV2 is not mitigated within 60 minutes, escalate to SEV1 procedures. If the on-call engineer cannot classify severity within 10 minutes, default to SEV2 until more information is available.

Common tasks

Design an on-call rotation

Rotation structure:

Primary on-call:    First responder. Acks within 5 min (SEV1) or 15 min (SEV2).
Secondary on-call:  Backup if primary misses ack window. Auto-escalated by pager.
Manager escalation: If both primary and secondary miss ack. Also for SEV1 war rooms.

Scheduling guidelines:

Rotate weekly. Never assign the same person two consecutive weeks without a gap.
Minimum team size for sustainable on-call: 5 engineers (allows 1-in-5 rotation).
Follow-the-sun for distributed teams: hand off to the next timezone instead of paging at 3am. Each region covers business hours + 2 hours buffer.
Provide comp time or additional pay for after-hours pages. Track and review quarterly.

On-call health metrics:

Metric	Healthy	Unhealthy
Pages per on-call week	< 5	> 10
After-hours pages per week	< 2	> 5
Mean time-to-ack (SEV1)	< 5 min	> 15 min
Mean time-to-ack (SEV2)	< 15 min	> 30 min
Percentage of pages with runbooks	> 80%	< 50%

Write a runbook

Every runbook must contain these sections:

Title:        [Alert name] - [Service name] Runbook
Last updated: [date]
Owner:        [team or individual]

1. SYMPTOM
   What the alert tells you. Quote the alert condition verbatim.

2. IMPACT
   Who is affected. Severity level. Business impact in plain language.

3. INVESTIGATION STEPS
   Numbered steps. Each step has:
   - What to check (command, dashboard link, or query)
   - What a normal result looks like
   - What an abnormal result means and what to do next

4. MITIGATION STEPS
   Numbered steps to stop the bleeding. Prioritize speed over elegance.
   Include rollback commands, feature flag toggles, and traffic shift procedures.

5. ESCALATION
   Who to contact if steps 3-4 do not resolve the issue within [N] minutes.
   Include name, team, and pager handle.

6. CONTEXT
   Links to: service architecture doc, relevant dashboards, past incidents,
   and the service's on-call schedule.

Runbook quality test: A new team member who has never seen this service should be able to follow the runbook and either resolve the issue or escalate correctly within 30 minutes.

Conduct a post-mortem

When to hold one: Every SEV1. Every SEV2 with customer impact. Any incident consuming more than 4 hours of engineering time. Recurring SEV3s from the same cause.

Timeline:

Hour 0:     Incident resolved. IC assigns post-mortem owner.
Day 1:      Owner drafts timeline and initial analysis.
Day 2-3:    Facilitated post-mortem meeting (60-90 minutes).
Day 3-4:    Draft published for 24-hour review period.
Day 5:      Final version published. Action items entered in tracker.
Day 30:     Action item review - are they done?

The five post-mortem questions:

What happened? (factual timeline with timestamps)
Why did it happen? (root cause analysis - use the "five whys" technique)
Why was it not detected sooner? (monitoring and alerting gap)
What slowed down the response? (process and tooling gap)
What prevents recurrence? (action items)

Action item rules: Every action item must have an owner, a due date, a priority (P0/P1/P2), and a measurable definition of done. "Improve monitoring" is not an action item. "Add latency p99 alert for checkout-api with a 500ms threshold, owned by @alice, due 2026-04-01" is.

See references/postmortem-template.md for the full template.

Set up a status page

Page structure:

Components:
  - Group by user-facing service (API, Dashboard, Mobile App, Webhooks)
  - Each component has a status: Operational | Degraded | Partial Outage | Major Outage
  - Show uptime percentage over 90 days per component

Incidents:
  - Title: clear, customer-facing description (not internal jargon)
  - Updates: timestamped entries showing investigation progress
  - Resolution: what was fixed and what customers need to do (if anything)

Maintenance:
  - Scheduled windows with start/end times in customer's timezone
  - Description of impact during the window

Communication cadence during incidents:

Phase	Update frequency	Content
Investigating	Every 30 min	"We are aware and investigating" + symptoms
Identified	Every 30 min	Root cause identified, ETA if known
Monitoring	Every 60 min	Fix deployed, monitoring for stability
Resolved	Once	Summary of what happened and what was fixed

Writing rules for status updates:

Use plain language. No internal service names, error codes, or jargon.
State the customer impact first, then what you are doing about it.
Never say "no impact" if customers reported problems.
Include timezone in all timestamps.

Run a war room

War room activation criteria: Any SEV1. Any SEV2 not mitigated within 30 minutes. Any incident affecting multiple services or teams.

War room protocol:

Minute 0-5:   IC opens the war room (video call + shared channel).
              IC states: incident summary, current severity, affected services.
              IC assigns roles: Communications Lead, Technical Lead, Scribe.

Minute 5-15:  Technical Lead drives initial investigation.
              Scribe starts the timeline document.
              Communications Lead posts first status page update.

Every 15 min: IC runs a checkpoint:
              - "What do we know now?"
              - "What are we trying next?"
              - "Do we need to escalate or bring in more people?"
              - "Is the status page current?"

Resolution:   IC confirms the fix is deployed and metrics are recovering.
              Communications Lead posts resolution update.
              IC schedules the post-mortem and assigns an owner.
              War room closed.

War room rules:

One conversation at a time. IC moderates.
No side investigations without telling the IC.
All commands run against production are announced before execution.
The scribe logs every significant action with a timestamp.
If the war room exceeds 2 hours, IC rotates or brings a fresh IC.

Build an escalation policy

Escalation ladder:

Level 0: Automated response (auto-restart, auto-scale, circuit breaker)
Level 1: On-call engineer (primary)
Level 2: On-call engineer (secondary) + team lead
Level 3: Engineering manager + dependent service on-calls
Level 4: Director/VP + incident commander (SEV1 only)

Escalation triggers:

Trigger	Action
Primary on-call does not ack within 5 min (SEV1)	Auto-page secondary
No mitigation progress after 30 min	Escalate one level
Customer-reported incident (not alert-detected)	Escalate one level immediately
Incident spans multiple services	Page all affected service on-calls
Data loss suspected	Immediate SEV1, escalate to Level 4

Anti-patterns / common mistakes

Mistake	Why it is wrong	What to do instead
No runbooks for alerts	Every page becomes an investigation from scratch; MTTR skyrockets	Treat "alert without runbook" as a blocking issue; write the runbook during the incident
Blameful post-mortems	Engineers hide mistakes, avoid risk, and stop reporting near-misses	Use a blameless template; explicitly ban naming individuals as root causes
Status page updates only at resolution	Customers assume you do not know or do not care; support tickets flood in	Update every 30 minutes minimum; assign a dedicated Communications Lead
On-call without compensation or rotation limits	Burnout, attrition, and degraded response quality	Cap rotations, provide comp time, track health metrics quarterly
War rooms without an Incident Commander	Multiple people investigate the same thing, no one communicates, chaos	Always assign an IC first; the IC's job is coordination, not debugging
Post-mortem action items with no owner or deadline	Items rot in a document; the same incident repeats	Every action item needs: owner, due date, priority, and definition of done

Gotchas

Severity escalation delays compound MTTR - The most common cause of a 2-hour incident that should have taken 30 minutes is a 45-minute delay in escalating from SEV3 to SEV2. The escalation rule "if no mitigation progress after 30 minutes, escalate one level" is not optional - build it into your pager escalation policy as an automatic trigger, not a judgment call.
Post-mortem action items decay without a 30-day review - Action items written in the heat of post-mortem often get deprioritized as new features take over the sprint. Without a mandatory 30-day follow-up meeting with the IC and action item owners, the same incident repeats within 6 months. Treat action item review as a blocking ceremony, not a nice-to-have.
Status page updates that use internal jargon erode customer trust - Saying "the Kafka consumer group is lagging due to a partition rebalance" confuses customers and implies you don't know how to communicate. Customers need to know the symptom they're experiencing, whether you're aware, and when you expect resolution. Translate everything to user impact before posting.
War rooms without a single Incident Commander devolve into chaos - When multiple senior engineers simultaneously investigate, propose fixes, and run commands against production without coordination, changes step on each other and the true root cause gets masked by noise. The IC role is not debugging - it is traffic control. Assign an IC before anyone runs a single query.
Runbooks that haven't been tested under stress are not runbooks - A runbook that works when you write it (calm, familiar with the system, full context) may be unusable at 3am by a tired on-call engineer seeing the service for the first time. Run fire drills where engineers who didn't write the runbook follow it end-to-end. Gaps in instructions surface immediately.

References

For detailed guidance on specific incident management domains, load the relevant file from references/:

references/postmortem-template.md - full blameless post-mortem template with example entries, facilitation guide, and action item tracker format
references/runbook-template.md - detailed runbook template with example investigation steps and mitigation procedures
references/status-page-guide.md - status page setup guide with communication templates and incident update examples
references/war-room-checklist.md - war room activation checklist, role cards, and checkpoint script

Only load a references file when the current task requires it.

References

postmortem-template.md

Post-mortem Template

Document header

Title:           [SEV level] [Brief description of the incident]
Date:            [Date of incident]
Duration:        [Start time - End time, including timezone]
Authors:         [Post-mortem owner]
Status:          Draft | In Review | Final
Severity:        SEV1 | SEV2 | SEV3
Services affected: [List of affected services]
Customer impact: [Brief description of user-facing impact]

1. Summary

Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be readable by a non-engineer.

Example:

On 2026-03-10 between 14:22 and 15:47 UTC, the checkout service returned 500 errors for approximately 30% of payment requests. An estimated 2,400 customers were unable to complete purchases during this window. The root cause was a connection pool exhaustion triggered by a configuration change deployed at 14:15 UTC. The incident was resolved by rolling back the configuration change and increasing the connection pool size.

2. Timeline

Use UTC timestamps. Include both automated events (alerts, deploys) and human actions (who did what).

14:15 UTC  - Deploy #4521 pushed to production (config change to DB pool settings)
14:22 UTC  - Checkout-api error rate alert fires (threshold: 1%, observed: 8%)
14:24 UTC  - On-call engineer @alice acks the page
14:27 UTC  - @alice opens war room, assigns IC role to @bob
14:30 UTC  - Status page updated: "Investigating increased errors on checkout"
14:35 UTC  - @alice identifies connection pool exhaustion in service metrics
14:40 UTC  - @alice correlates with deploy #4521 timeline
14:45 UTC  - Decision: rollback deploy #4521
14:48 UTC  - Rollback initiated
14:55 UTC  - Rollback complete. Error rate dropping.
15:00 UTC  - Status page updated: "Fix deployed, monitoring"
15:30 UTC  - Error rate back to baseline (0.05%)
15:47 UTC  - IC @bob declares incident resolved
15:47 UTC  - Status page updated: "Resolved"

3. Root cause analysis

What happened

Describe the technical chain of events. Be specific about the failure mode.

Five whys

Why 1: Why did checkout fail?
  -> Connection pool was exhausted; new requests could not get a DB connection.

Why 2: Why was the connection pool exhausted?
  -> Deploy #4521 reduced max_connections from 100 to 10.

Why 3: Why was that configuration change deployed?
  -> An engineer was tuning connection settings for a staging environment
     and accidentally included the production config file.

Why 4: Why did the production config get included?
  -> The staging and production configs are in the same directory with
     similar names (db-config-staging.yaml, db-config-prod.yaml).

Why 5: Why was there no safeguard?
  -> No automated validation checks connection pool size against a minimum
     threshold before deploy.

Contributing factors

List factors that did not cause the incident but made it worse or slower to resolve:

No deployment diff review required for config-only changes
Connection pool metric was not on the checkout service dashboard
Runbook for this alert did not mention checking recent deploys

4. Detection analysis

Question	Answer
How was the incident detected?	Automated alert on error rate
Time from cause to detection	7 minutes
Could we have detected it sooner?	Yes - a config validation check at deploy time would have caught it instantly
Were there earlier signals we missed?	Connection pool utilization was at 95% for 5 minutes before errors started, but no alert was configured for pool saturation

5. Response analysis

Question	Answer
Time from detection to ack	2 minutes
Time from ack to mitigation start	21 minutes
Time from mitigation start to resolution	62 minutes
What went well in the response?	Fast ack, war room opened quickly, status page updated promptly
What could have been faster?	Correlating the deploy with the outage took 13 minutes; an automated deploy correlation tool would have flagged it immediately

6. Impact assessment

Dimension	Measurement
Duration	85 minutes
Users affected	~2,400 (30% of checkout traffic)
Revenue impact	Estimated $18,000 in delayed purchases (95% recovered within 2 hours)
SLO budget consumed	12% of monthly error budget
Support tickets	47 tickets opened
Data loss	None

7. Action items

Every action item must have: owner, due date, priority, and definition of done.

ID	Action item	Owner	Priority	Due date	Status
AI-1	Add config validation to deploy pipeline: reject connection pool size < 20	@charlie	P0	2026-03-24	Open
AI-2	Separate staging and production config directories	@alice	P1	2026-04-07	Open
AI-3	Add connection pool utilization alert (threshold: 80%) to checkout-api	@alice	P1	2026-03-28	Open
AI-4	Update checkout-api runbook to include "check recent deploys" as step 2	@bob	P2	2026-03-21	Open
AI-5	Evaluate automated deploy-correlation tool for the incident dashboard	@dave	P2	2026-04-14	Open

8. Lessons learned

What went well

Alert fired quickly (7 minutes from cause)
War room was organized and focused
Status page was updated within 8 minutes of the page
Rollback was clean and effective

What did not go well

Config change had no validation gate
It took 13 minutes to identify the deploy as the cause
The runbook did not mention checking recent deployments

Where we got lucky

The config change was easily rollbackable. A schema migration with the same type of error would have been much harder to reverse.

Facilitation guide

Before the meeting

Post-mortem owner drafts sections 1-3 before the meeting
Share the draft with all participants 2 hours before the meeting
Remind everyone: this is a blameless review. We discuss systems, not individuals.

During the meeting (60-90 minutes)

(5 min) IC reads the summary and timeline aloud
(20 min) Walk through the root cause analysis. Ask: "Is there anything missing?"
(15 min) Review detection and response. Ask: "Where could we have been faster?"
(20 min) Draft action items as a group. For each: agree on owner and priority.
(10 min) Capture lessons learned. Ask: "What went well? What did not?"
(5 min) Agree on review date for action items (usually 30 days)

After the meeting

Owner finalizes the document within 24 hours
Publish to the team's incident archive
Enter all action items in the issue tracker with the agreed due dates
Schedule a 30-day follow-up to verify action item completion

runbook-template.md

Runbook Template

Standard runbook structure

Every runbook follows the same six-section format. Consistency across runbooks means on-call engineers can find information in the same place every time, even for services they have never seen before.

Template

# [Alert Name] - [Service Name] Runbook

**Last updated:** [YYYY-MM-DD]
**Owner:** [Team or individual responsible for this runbook]
**Alert source:** [Monitoring tool and alert ID/link]
**Related services:** [Upstream and downstream dependencies]

---

## 1. SYMPTOM

[Quote the alert condition verbatim. Include the metric, threshold, and window.]

Example:
> Alert: checkout-api-error-rate
> Condition: HTTP 5xx rate > 1% for 5 minutes
> Dashboard: [link to dashboard]

---

## 2. IMPACT

**Who is affected:** [Customer segment or internal users]
**How they are affected:** [Cannot checkout, see errors, experience slowness]
**Severity:** [SEV1/SEV2/SEV3 - reference the severity classification table]
**Business impact:** [Revenue, data integrity, compliance, reputation]

---

## 3. INVESTIGATION STEPS

Follow these steps in order. At each step, the result tells you where to go next.

### Step 1: Check the dashboard
- Open: [dashboard link]
- Normal: Error rate < 0.1%, latency p99 < 300ms
- Abnormal: If error rate is elevated, proceed to Step 2

### Step 2: Check recent deployments
- Run: `kubectl rollout history deployment/checkout-api -n production`
- Or check: [deploy tool link]
- If a deploy happened in the last 30 minutes, this is likely the cause.
  Go to Mitigation Step A (Rollback).

### Step 3: Check downstream dependencies
- Open: [dependency dashboard link]
- Check: database connection pool, payment gateway status, cache hit rate
- If a dependency is degraded, the issue is upstream. Escalate to that
  service's on-call (see Escalation section).

### Step 4: Check resource utilization
- Run: `kubectl top pods -n production -l app=checkout-api`
- Normal: CPU < 70%, Memory < 80%
- If resources are exhausted, go to Mitigation Step B (Scale).

### Step 5: Check application logs
- Run: `kubectl logs -l app=checkout-api -n production --tail=200 | grep ERROR`
- Or query: [log aggregator link with pre-built query]
- Look for: stack traces, connection refused, timeout errors
- If you see a new error pattern, document it and escalate.

---

## 4. MITIGATION STEPS

### Step A: Rollback the last deployment
```bash
kubectl rollout undo deployment/checkout-api -n production

Monitor error rate for 10 minutes. If it returns to baseline, the deploy was the cause. Document and proceed to post-mortem.

Step B: Scale the service

kubectl scale deployment/checkout-api -n production --replicas=10

Monitor for 5 minutes. If error rate drops, the issue is capacity-related. Investigate the traffic spike source.

Step C: Restart pods (last resort)

kubectl rollout restart deployment/checkout-api -n production

Use only if Steps A and B did not help and you suspect a memory leak or stuck process. This causes brief service disruption during rolling restart.

Step D: Toggle feature flag (if applicable)

Open: [feature flag tool link]
Disable: [flag name] for production environment
This removes the most recent feature change without a full rollback.

5. ESCALATION

If the above steps do not resolve the issue within 30 minutes, escalate:

Priority	Contact	How to reach
First	[Service team on-call]	Page via [pager tool]
Second	[Team lead / engineering manager]	Page via [pager tool]
Third	[Dependent service on-call]	Page via [pager tool] - use for dependency issues
SEV1	[Director / VP Engineering]	Phone: [number]

6. CONTEXT

Architecture doc: [link]
Service dashboard: [link]
Dependency map: [link]
Past incidents:
- [INC-1234] - Similar error spike caused by config change (2026-01)
- [INC-1189] - Database failover caused checkout errors (2025-11)
On-call schedule: [link]
Deployment pipeline: [link]


---

## Runbook quality checklist

Before publishing a runbook, verify:

- [ ] Alert condition is quoted verbatim with metric name and threshold
- [ ] Impact section states who is affected in plain language
- [ ] Every investigation step has a "normal" and "abnormal" result
- [ ] Mitigation steps include actual commands or tool links, not just descriptions
- [ ] Escalation contacts are current (review quarterly)
- [ ] Context links are not broken
- [ ] A new team member can follow this without prior service knowledge

## Runbook maintenance

- **Review cadence:** Every runbook must be reviewed every 90 days
- **Update triggers:** Any incident where the runbook was incomplete or wrong
- **Ownership:** The team that owns the service owns the runbook
- **Testing:** During on-call onboarding, have new engineers walk through runbooks
  for the top 5 most-paged alerts as a tabletop exercise

## Common investigation commands reference

```bash
# Kubernetes - check pod status
kubectl get pods -n production -l app=SERVICE_NAME

# Kubernetes - check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | head -20

# Kubernetes - check resource usage
kubectl top pods -n production -l app=SERVICE_NAME

# Kubernetes - check rollout status
kubectl rollout status deployment/SERVICE_NAME -n production

# Kubernetes - view recent logs
kubectl logs -l app=SERVICE_NAME -n production --tail=100 --since=10m

# Database - check active connections (PostgreSQL)
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Database - check long-running queries (PostgreSQL)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;

status-page-guide.md

Status Page Guide

Status page structure

Components

Organize by user-facing service, not internal architecture. Customers do not care which microservice is down - they care what they cannot do.

Good component names:

Checkout & Payments
User Dashboard
API (v2)
Mobile App
Webhooks & Notifications
Data Exports

Bad component names (internal jargon):

payment-gateway-service
redis-cache-cluster
kafka-consumer-group-3

Component statuses

Status	Meaning	When to use
Operational	Everything working normally	Default state
Degraded Performance	Slower than normal but functional	Elevated latency, partial slowdown
Partial Outage	Some users or features affected	Errors for a subset of requests
Major Outage	Service is unavailable	Complete failure of a core function
Under Maintenance	Planned downtime	Scheduled maintenance windows

Incident update templates

Investigating

Title: Elevated error rates on [Component]

Update (HH:MM UTC):
We are investigating reports of [symptom in plain language]. Some customers
may experience [specific impact - e.g., "errors when attempting to check out"
or "slower page load times on the dashboard"].

We will provide an update within 30 minutes.

Identified

Update (HH:MM UTC):
We have identified the cause of [symptom]. [One sentence about the cause in
plain language - e.g., "A configuration change is causing connection issues
with our payment processor."]

Our engineering team is working on a fix. We expect to have this resolved
within [estimated time if known, or "the next 1-2 hours"].

We will provide another update within 30 minutes.

Monitoring

Update (HH:MM UTC):
We have deployed a fix for [symptom]. Our systems are recovering and we are
monitoring to confirm the issue is fully resolved.

[If applicable: "Some customers may continue to see intermittent errors for
the next 10-15 minutes as the fix propagates."]

We will provide a final update once we confirm full recovery.

Resolved

Update (HH:MM UTC):
This incident has been resolved. [Component] is operating normally.

Summary: Between [start time] and [end time] UTC, [brief description of what
happened and who was affected]. [One sentence on what was done to fix it.]

[If applicable: "No customer action is required." or "If you experienced
[specific issue], please [specific action - e.g., retry your request, contact
support at support@example.com]."]

We will be conducting a thorough post-incident review to prevent recurrence.
We apologize for the disruption.

Maintenance notification templates

Scheduled maintenance (advance notice)

Title: Scheduled maintenance for [Component]

We will be performing scheduled maintenance on [Component] on [date] from
[start time] to [end time] UTC ([convert to major customer timezones]).

During this window:
- [Specific impact - e.g., "The API will return 503 errors"]
- [What will still work - e.g., "The dashboard will remain accessible in
  read-only mode"]
- [Estimated duration of actual downtime within the window]

[If applicable: "We recommend scheduling any critical operations before or
after this maintenance window."]

We will update this notice when maintenance begins and when it is complete.

Communication cadence

Incident phase	Update frequency	Who writes
Investigating	Every 30 minutes	Communications Lead
Identified	Every 30 minutes	Communications Lead
Monitoring	Every 60 minutes	Communications Lead
Resolved	Once (final update)	Communications Lead + IC review

Rules:

Never go more than 30 minutes without an update during an active incident
If there is no new information, say so: "We are continuing to investigate. No new information at this time. Next update in 30 minutes."
All timestamps in UTC with local timezone equivalents for major customer regions
The IC reviews the resolved update before publishing

Writing guidelines

Do

State the customer impact first, then what you are doing
Use plain language a non-technical person can understand
Be honest about what you know and do not know
Include specific times and durations
Acknowledge the disruption: "We apologize for the inconvenience"

Do not

Use internal service names, error codes, or technical jargon
Say "no impact" if customers are reporting problems
Blame third parties without confirmation ("our cloud provider caused...")
Promise specific resolution times unless you are confident
Use passive voice to hide accountability ("errors were experienced")

Tone

Professional but human
Direct and factual
Empathetic without being overly apologetic
Confident about what you know, transparent about what you do not

Status page tool setup checklist

When setting up a new status page:

Define components based on user-facing services (not internal architecture)
Set up subscriber notifications (email, SMS, webhook, RSS)
Configure automated status updates from monitoring (operational/degraded)
Pre-draft incident templates (copy from this guide)
Assign ownership: who can publish updates (on-call + Communications Lead)
Test the notification flow end-to-end before going live
Add the status page link to your application's footer and support docs
Configure maintenance window scheduling
Set up uptime metrics display (90-day rolling window per component)
Review and update component list quarterly as services evolve

war-room-checklist.md

War Room Checklist

Activation criteria

Open a war room when any of these conditions are met:

SEV1 incident declared
SEV2 not mitigated within 30 minutes
Incident affects multiple services or teams
Incident has customer-visible impact and no clear cause after 15 minutes
Incident commander requests a war room for any reason

War room activation checklist (first 5 minutes)

The person who activates the war room (usually the on-call engineer or IC) runs through this checklist:

Open a dedicated video call (use the team's standing war room link)
Create or identify the incident channel (e.g., #inc-2026-03-14-checkout)

Post the incident summary in the channel:

INCIDENT ACTIVE
Severity: [SEV1/SEV2]
Summary: [One sentence - what is broken]
Impact: [Who is affected and how]
Started: [HH:MM UTC]
IC: [@name]
War room: [video call link]

Assign roles (see Role Cards below)
Confirm all role holders have joined the war room
Scribe creates the incident timeline document

Role cards

Incident Commander (IC)

Primary job: Coordinate the response. You are NOT debugging.

Checklist:

Confirm severity classification
Assign all roles (Communications Lead, Technical Lead, Scribe)
Run 15-minute checkpoints (see Checkpoint Script)
Make escalation decisions
Approve rollback or mitigation actions
Decide when to declare resolution
Assign post-mortem owner before closing the war room

Rules:

Do not investigate the issue yourself. Delegate.
If you are also the most qualified person to debug, hand off IC to someone else.
If the war room exceeds 2 hours, rotate IC or bring in a fresh one.

Communications Lead

Primary job: Keep customers, stakeholders, and support informed.

Checklist:

Post first status page update within 10 minutes of war room opening
Update status page every 30 minutes (or sooner if there is new information)
Notify internal stakeholders (support team, account managers for affected customers)
Draft the resolution update for IC review before publishing
Compile a list of customer reports or support tickets for the post-mortem

Rules:

Use the templates from references/status-page-guide.md
Every update must be reviewed by IC before publishing (exception: "still investigating" updates)
Never share internal details, blame, or unconfirmed root causes externally

Technical Lead

Primary job: Drive the investigation and fix.

Checklist:

Follow the relevant runbook (if one exists)
Announce investigation steps before executing them
Report findings to IC at each checkpoint
Propose mitigation options with trade-offs
Execute the approved fix
Confirm metrics are recovering after the fix

Rules:

Announce all production commands before running them
If the runbook does not cover this scenario, say so immediately
If you need help, tell the IC. Do not silently struggle.

Scribe

Primary job: Maintain a real-time timeline of the incident.

Checklist:

Create the timeline document (use the team's incident template)
Log every significant action with a UTC timestamp
Log who did what (not just what happened)
Log decisions and the reasoning behind them
Log things that were tried but did not work
At resolution, hand the timeline to the post-mortem owner

Rules:

Capture facts, not interpretations
If something is unclear, ask for clarification and log the answer
The timeline is the primary input for the post-mortem - completeness matters

Checkpoint script (every 15 minutes)

The IC runs this script at each checkpoint. Read it aloud:

CHECKPOINT - [HH:MM UTC]

1. STATUS CHECK
   "Technical Lead: What do we know now that we didn't know 15 minutes ago?"

2. NEXT STEPS
   "What are we trying next? Who is doing it?"

3. ESCALATION CHECK
   "Do we need to bring in anyone else? Any dependent teams?"

4. COMMUNICATIONS CHECK
   "Communications Lead: Is the status page current? When is the next update due?"

5. TIMELINE CHECK
   "Scribe: Are we capturing everything? Anything to add?"

6. SEVERITY CHECK
   "Has the severity changed? Should we escalate or de-escalate?"

War room rules

Post these rules in the incident channel at the start of every war room:

WAR ROOM RULES
1. One conversation at a time. IC moderates.
2. Announce all production commands BEFORE running them.
3. No side investigations without telling the IC.
4. If you join late, read the timeline first. Do not ask "what happened?"
5. Mute when not speaking (video call).
6. Keep the channel for incident discussion only. Use threads for tangents.
7. If you do not have a role, observe silently unless asked.

War room closure checklist

When the IC declares the incident resolved:

Confirm metrics have returned to baseline for at least 15 minutes
Communications Lead posts the resolution update on the status page
IC assigns a post-mortem owner and sets a deadline (within 48 hours)
Scribe finalizes the timeline and shares it with the post-mortem owner

IC posts a summary in the incident channel:

INCIDENT RESOLVED
Duration: [X hours Y minutes]
Root cause: [One sentence]
Fix applied: [One sentence]
Post-mortem owner: [@name]
Post-mortem deadline: [date]

IC thanks everyone who participated
Close the war room video call
Archive the incident channel (do not delete - it is a historical record)

War room anti-patterns

Anti-pattern	Why it is harmful	What to do instead
IC also debugging	Coordination stops, chaos increases	IC delegates all investigation
No scribe	Post-mortem has no accurate timeline	Always assign a scribe, even for SEV2
Side conversations	IC loses track of what is happening	Enforce "one conversation" rule
Heroic solo debugging	Others cannot help or learn; single point of failure	Announce all actions; pair on investigation
No checkpoints	Investigation drifts; people work on the wrong thing	IC runs checkpoint script every 15 minutes
War room stays open after resolution	Fatigue, wasted time	Close promptly once metrics are stable

Frequently Asked Questions

What is incident-management?

How do I install incident-management?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill incident-management in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support incident-management?

incident-management works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is incident-management free?

Yes, incident-management is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between incident-management and similar tools?

incident-management is an AI agent skill that teaches your coding agent specialized operations knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use incident-management with Cursor or Windsurf?

incident-management works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

incident-management

What is incident-management?

Quick Start

incident-management

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is incident-management?

How do I install incident-management?

What AI agents support incident-management?

Maintainers

SKILL.md

Incident Management

When to use this skill

Key principles

Core concepts

Incident lifecycle

Incident roles

Severity classification

Common tasks

Design an on-call rotation

Write a runbook

Conduct a post-mortem

Set up a status page

Run a war room

Build an escalation policy

Anti-patterns / common mistakes

Gotchas

References

References

postmortem-template.md

Post-mortem Template

Document header

1. Summary

2. Timeline

3. Root cause analysis

What happened

Five whys

Contributing factors

4. Detection analysis

5. Response analysis

6. Impact assessment

7. Action items

8. Lessons learned

What went well

What did not go well

Where we got lucky

Facilitation guide

Before the meeting

During the meeting (60-90 minutes)

After the meeting

runbook-template.md

Runbook Template

Standard runbook structure

Template

Step B: Scale the service

Step C: Restart pods (last resort)

Step D: Toggle feature flag (if applicable)

5. ESCALATION

6. CONTEXT

status-page-guide.md

Status Page Guide

Status page structure

Components

Component statuses

Incident update templates

Investigating

Identified

Monitoring

Resolved

Maintenance notification templates

Scheduled maintenance (advance notice)

Communication cadence

Writing guidelines

Do

Do not