site-reliability
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
engineering srereliabilityerror-budgetstoilcapacityincident-managementWhat is site-reliability?
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
site-reliability
site-reliability is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability.
Quick Facts
| Field | Value |
|---|---|
| Category | engineering |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability- The site-reliability skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
SRE is the discipline of applying software engineering to operations problems. It replaces ad-hoc ops work with principled systems: reliability targets backed by error budgets, toil replaced by automation, and incidents treated as system failures rather than human ones. This skill covers the full SRE lifecycle - from defining SLOs through capacity planning and progressive delivery - as practiced by teams operating production systems at scale. Designed for engineers moving from "keep the lights on" to systematic reliability ownership.
Tags
sre reliability error-budgets toil capacity incident-management
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair site-reliability with these complementary skills:
Frequently Asked Questions
What is site-reliability?
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
How do I install site-reliability?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support site-reliability?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Site Reliability Engineering
SRE is the discipline of applying software engineering to operations problems. It replaces ad-hoc ops work with principled systems: reliability targets backed by error budgets, toil replaced by automation, and incidents treated as system failures rather than human ones. This skill covers the full SRE lifecycle - from defining SLOs through capacity planning and progressive delivery - as practiced by teams operating production systems at scale. Designed for engineers moving from "keep the lights on" to systematic reliability ownership.
When to use this skill
Trigger this skill when the user:
- Needs to define or revise SLOs, SLIs, or SLAs for a service
- Is calculating or acting on an error budget
- Wants to identify, measure, or automate toil
- Is running or writing a postmortem
- Is designing or improving an on-call rotation
- Is forecasting capacity needs or planning a load test
- Is designing a rollout strategy (canary, blue/green, progressive)
Do NOT trigger this skill for:
- Pure infrastructure provisioning without a reliability framing (use a Docker/K8s skill)
- Application performance optimization without an SLO context (use a performance-engineering skill)
Key principles
Embrace risk with error budgets - 100% reliability is neither achievable nor desirable. Every extra nine of availability comes at a cost: slower feature velocity, more complex systems, higher operational burden. An error budget makes the trade-off explicit: spend budget on risk-taking (deploys, experiments), save it when reliability is threatened.
Eliminate toil - Toil is work that is manual, repetitive, automatable, reactive, and scales with service growth without producing lasting value. Every hour of toil is an hour not spent on reliability improvements. The goal is not zero toil (some is unavoidable) but continuous reduction.
SLOs are the contract - SLOs align engineering and business on what reliability is worth. They prevent both over-engineering ("five nines or nothing") and under-investing ("it mostly works"). Write SLOs before writing on-call runbooks; the SLO defines what warrants waking someone up.
Blameless postmortems - Systems fail, not people. Blaming individuals creates an environment where engineers hide problems and avoid risk. Blameless postmortems surface systemic issues and produce durable fixes. The goal is learning, not accountability theater.
Automate yourself out of a job - The SRE charter is to automate operations work until the team's operational load is below 50% of their time. The remaining capacity is reserved for reliability engineering that makes the next incident less likely or less severe.
Core concepts
SLI / SLO / SLA hierarchy
SLA (Service Level Agreement)
- External contract with customers. Breach triggers penalties.
- Set conservatively: your internal SLO must be tighter than your SLA.
SLO (Service Level Objective)
- Internal target. Drives alerting, error budgets, and engineering decisions.
- Typically SLO = SLA - 0.5 to 1 percentage point headroom.
SLI (Service Level Indicator)
- The actual measurement. A ratio: good events / total events.
- Example: (requests completing < 300ms) / (all requests)Rule of thumb: Define one availability SLI and one latency SLI per user-facing service. Add correctness SLIs for data pipelines or financial systems.
Error budget mechanics
Error budget = 1 - SLO target
99.9% SLO -> 0.1% budget -> 43.8 min/month at risk
99.5% SLO -> 0.5% budget -> 3.65 hours/month at risk
Budget consumed = (bad events this window) / (total events this window)
Budget remaining = budget_total - budget_consumedBurn rate = observed error rate / allowed error rate. A burn rate of 1 means you are spending budget at exactly the expected pace. A burn rate of 14.4 on a 30-day window means the budget is gone in 50 hours.
Budget policy (what to do when budget is threatened):
| Budget remaining | Action |
|---|---|
| > 50% | Normal feature velocity, deploys allowed |
| 25-50% | Review recent changes, increase monitoring |
| 10-25% | Freeze non-essential deploys, focus on stability |
| < 10% | Feature freeze, all hands on reliability work |
Toil definition
Toil has all of these properties - if even one is missing, it may be legitimate work:
- Manual: A human is in the loop doing repetitive keystrokes
- Repetitive: Done more than once with the same steps
- Automatable: A script or system could do it
- Reactive: Triggered by a system event, not proactive engineering
- No lasting value: Executing it does not improve the system; it just holds it steady
- Scales with load: More traffic, more toil (a danger sign)
Incident severity levels
| Severity | Customer impact | Response | Example |
|---|---|---|---|
| SEV1 | Complete outage or data loss | Immediate page, war room | Payment service down |
| SEV2 | Degraded core functionality | Page on-call | 20% of requests erroring |
| SEV3 | Minor degradation, workaround exists | Ticket, next business day | Slow dashboard loads |
| SEV4 | Cosmetic issue or internal tool | Backlog | Wrong label in admin UI |
On-call best practices
- Rotate weekly; never longer than two weeks without a break
- Guarantee engineers sleep: no P1 pages between 10pm-8am without escalation
- Track on-call load: pages per shift, time-to-ack, total hours interrupted
- Every on-call shift ends with a handoff: active incidents, lingering alerts, context
- Budget 20-30% of the next sprint for on-call follow-up work
Common tasks
Define SLOs for a service
Step 1: Choose the right SLIs. Start from user journeys, not technical metrics.
| User journey | SLI type | Measurement |
|---|---|---|
| "Page loads fast" | Latency | requests_under_300ms / total_requests |
| "API calls succeed" | Availability | non_5xx_responses / total_responses |
| "Data is correct" | Correctness | correct_outputs / total_outputs |
| "Writes persist" | Durability | successful_writes_verified / total_writes |
Step 2: Set targets using historical data.
1. Pull 30 days of your current SLI measurements
2. Find your current actual performance (e.g., 99.85% availability)
3. Set SLO slightly below current performance (e.g., 99.7%)
4. Tighten over time as you improve reliabilityNever set an SLO tighter than your best recent 30-day window without a corresponding reliability investment plan.
Step 3: Choose the window. Rolling 30-day windows are standard. They smooth spikes but respond to sustained degradation. Avoid calendar month windows - they reset budgets on the 1st regardless of what happened on the 31st.
Step 4: Define measurement exclusions. Planned maintenance, dependencies outside your control, and client errors (4xx) typically excluded from SLI calculations.
Calculate and track error budgets
Burn rate alerting (recommended over threshold alerting):
Fast burn alert (page immediately):
Condition: burn_rate > 14.4 for 5 minutes
Meaning: At this rate, 30-day budget exhausted in ~50 hours
Severity: SEV2, page on-call
Slow burn alert (ticket, investigate):
Condition: burn_rate > 3 for 60 minutes
Meaning: Budget exhausted in ~10 days if trend continues
Severity: SEV3, create ticket
Budget depletion alert (SEV1 escalation trigger):
Condition: budget_remaining < 10%
Action: Feature freeze, reliability sprintMulti-window alerting catches both fast spikes and slow degradation:
- 5-minute window: catches fast burns (major incident)
- 1-hour window: catches slow burns (creeping degradation)
- Both windows alerting together = high-confidence page
Budget depletion actions:
- Stop all non-essential deploys
- Pull toil-reduction and reliability items from the backlog
- Review the postmortem queue for unresolved action items
- Document the decision with date and budget percentage in your incident tracker
Identify and reduce toil
Toil taxonomy - classify before automating:
| Category | Examples | Priority |
|---|---|---|
| Interrupt-driven | Restarting crashed pods, clearing queues | High - on-call tax |
| Regular manual ops | Weekly capacity checks, certificate renewals | Medium - scheduled work |
| Deploy ceremony | Manual release steps, environment promotion | High - blocks velocity |
| Data cleanup | Fixing bad records, reconciliation jobs | Medium - correctness risk |
| Access management | Provisioning accounts, rotating credentials | High - security risk |
Automation prioritization matrix:
HIGH FREQUENCY
|
Quick to | Slow to
automate | automate
|
AUTOMATE FIRST -----+----- SCHEDULE: PLAN PROJECT
|
|
AUTOMATE WHEN -----+----- ACCEPT OR ELIMINATE
CONVENIENT |
|
LOW FREQUENCYMeasure toil before and after automation: track hours/week per category per engineer. If toil is growing, the automation is not keeping pace with service growth.
Run a blameless postmortem
When to hold one: Every SEV1. Every SEV2 with customer-visible impact. Any incident that consumed more than 4 hours of on-call time. Recurring SEV3s from the same root cause.
Timeline (24-48 hours after resolution):
Day 0 (during incident): Designate incident commander, keep a timeline in a shared doc
Day 1 (next morning): Assign postmortem owner, schedule meeting within 48 hours
Day 2 (postmortem): 60-90 min facilitated session
Day 3: Draft published internally for 24-hour comment period
Day 5: Final version published, action items entered in trackerThe five questions that drive every postmortem:
- What happened and when? (timeline)
- Why did it happen? (root cause - ask "why" five times)
- Why did we not detect it sooner? (detection gap)
- What slowed down the response? (mitigation gap)
- What prevents recurrence? (action items)
Action item rules: Each item must have an owner, a due date, and a measurable definition of done. "Improve monitoring" is not an action item. "Add burn-rate alert for payments-api availability SLO by 2025-Q3" is.
See references/postmortem-template.md for the full template with example entries
and facilitation guide.
Design on-call rotation
Rotation structure:
Primary on-call: First responder. Acks within 15 min, mitigates or escalates.
Secondary on-call: Backup if primary misses ack within 15 min.
Escalation path: Engineering manager -> Director -> Incident commander (for SEV1 only)Runbook requirements (every alert must have one):
- Symptom: what the alert is telling you
- Impact: who is affected and how severely
- Steps: numbered investigation and mitigation steps
- Escalation: who to call if steps do not resolve it
- Context: links to dashboards, service documentation, past incidents
Handoff process (end of each on-call rotation):
- Document any open or lingering issues
- List any alerts that fired but did not page (worth reviewing)
- Share known fragile areas or upcoming risky changes
- Review toil hours and open action items with incoming on-call
Health metrics for on-call load:
| Metric | Target | Alert threshold |
|---|---|---|
| Pages per on-call week | < 5 | > 10 |
| Pages outside business hours | < 2/week | > 5/week |
| Time-to-ack (P1) | < 5 min | > 15 min |
| Toil percentage of on-call time | < 50% | > 70% |
Plan capacity
Demand forecasting approach:
1. Baseline: measure current peak RPS, CPU, memory, storage
2. Growth rate: calculate month-over-month traffic growth (last 6 months)
3. Project forward: apply growth rate to 6-month and 12-month horizons
4. Add headroom: 30-50% above projected peak for burst capacity
5. Trigger threshold: the utilization level that kicks off provisioningLoad testing before capacity decisions:
- Define the traffic shape (ramp, steady state, spike)
- Test to 150% of expected peak - find the breaking point before users do
- Measure: latency distribution at load, error rate at load, resource utilization
- Identify the bottleneck (CPU, DB connections, memory) before scaling the wrong thing
Headroom planning table:
| Component | Trigger utilization | Target utilization | Action |
|---|---|---|---|
| Compute (CPU) | > 70% sustained | 40-60% | Horizontal scale |
| Memory | > 80% | 50-70% | Vertical scale or tune GC |
| Database (connections) | > 80% pool use | 50-70% | Connection pooler, scale up |
| Storage | > 75% | < 60% | Provision more, archive old data |
| Network throughput | > 70% | < 50% | Scale or upgrade links |
Cost vs reliability trade-off: Headroom is expensive. Justify each component's target with an SLO - a 99.9% availability SLO for a stateless service does not require the same headroom as a 99.99% SLO for a payment processor.
Implement progressive rollouts
Rollout ladder:
0.1% canary (10 min)
-> 1% (30 min, review metrics)
-> 5% (1 hour)
-> 25% (1 hour)
-> 50% (1 hour)
-> 100%Canary analysis - automatic promotion/rollback criteria:
| Signal | Rollback if | Promote if |
|---|---|---|
| Error rate | Canary > baseline + 0.5% | Canary <= baseline + 0.1% |
| p99 latency | Canary > baseline * 1.2 | Canary <= baseline * 1.05 |
| SLO burn rate | Canary burn rate > 5x | Canary burn rate <= 2x |
| CPU/Memory | Canary > baseline * 1.3 | Within 10% of baseline |
Automated rollback triggers: Instrument your CD pipeline to roll back automatically when error rate or latency breaches the canary threshold. Do not rely on humans to catch canary regressions - the whole point is to automate the decision. If your deployment tool does not support automated rollback, treat that as a toil item to fix.
Feature flags vs canary: Canary deploys test infrastructure changes (binary, container, config). Feature flags test product changes (code paths). Use both. Separate the risk of deploying new infrastructure from the risk of activating new behavior.
Gotchas
SLO window reset on the 1st creates budget gaming - Calendar month windows reset error budget on the 1st regardless of what happened on the 31st. Teams learn to push risky deploys right after reset. Use rolling 30-day windows which are always live and cannot be gamed.
Burn rate alerts with a single window produce too much noise - A 5-minute burn rate alert alone generates pages for transient spikes that self-recover. Multi-window alerting (5-minute AND 1-hour both elevated) dramatically reduces false positives while keeping sensitivity to real incidents.
Toil metrics without a reduction target are just bookkeeping - Measuring toil hours without committing to a reduction target and a sprint allocation to address it creates awareness without action. The measure only has value if it gates a quarterly automation investment.
Canary rollout with no automated rollback is manual canary - A canary that requires a human to notice the error rate spike and manually roll back is not a canary - it is a staged rollout with extra steps. Automated rollback on threshold breach is the defining property; without it, the safety benefit is largely absent.
On-call runbooks that say "escalate to engineering" - A runbook whose resolution step is "page someone else" does not reduce on-call burden; it just shifts it. Every runbook must include at least one concrete mitigation step the on-call can take before escalating.
Anti-patterns / common mistakes
| Mistake | Why it is wrong | What to do instead |
|---|---|---|
| Setting SLOs without historical data | Targets become aspirational fiction, not engineering constraints | Measure current performance first, set SLO at or slightly below it |
| Alerting on resource utilization not SLOs | CPU at 90% may not affect users; 1% error rate definitely does | Alert on SLO burn rate; use resource metrics for capacity planning only |
| Blameful postmortems | Engineers hide problems, avoid risky-but-necessary changes | Explicitly state "no blame" in the template; focus every question on systems |
| Counting toil in hours but not automating it | Creates awareness without action | Budget one sprint per quarter specifically for toil reduction |
| Infinite error budget freezes | Teams freeze deploys forever, killing velocity | Define explicit budget policy with percentage thresholds and time-bounded freezes |
| On-call without runbooks | Every incident requires heroics; knowledge stays in individuals | Treat "alert without runbook" as a blocker; write the runbook during the incident |
References
For detailed guidance on specific domains, load the relevant file from references/:
references/postmortem-template.md- full postmortem template with example entries, facilitation guide, and action item tracker
Only load a references file when the current task requires it.
References
postmortem-template.md
Postmortem Template
A blameless postmortem is a structured learning exercise, not an accountability hearing. The goal is to understand what happened, improve the system, and prevent recurrence. Every question in this template focuses on systems, processes, and tooling - not on individuals.
Facilitation Guide
Before the meeting:
- Designate one facilitator (neutral; preferably not the incident commander)
- Designate one scribe (takes verbatim notes, not the facilitator)
- Share the draft timeline 24 hours before so attendees can correct it
- Invite: incident responders, on-call, service owners, one representative from affected teams
- Block 60-90 minutes; complex incidents need 90
During the meeting:
- Open with: "This is a learning session. There is no blame here."
- Use "the system did X" not "you did X"
- When discussion gets heated: redirect to "what could the system have done differently?"
- Time-box root cause discussion to 30 minutes; spend remaining time on action items
- If someone is defensive, ask: "What would have had to be true for anyone in that position to have done differently?"
After the meeting:
- Publish draft within 24 hours for comment
- Finalize and share within 5 days of incident
- Enter all action items into your issue tracker with owners and due dates
- Review open action items at the start of every sprint retrospective
Postmortem Document
Incident metadata
| Field | Value |
|---|---|
| Incident ID | INC-YYYY-NNN |
| Date and time detected | YYYY-MM-DD HH:MM UTC |
| Date and time resolved | YYYY-MM-DD HH:MM UTC |
| Total duration | X hours Y minutes |
| Severity | SEV1 / SEV2 / SEV3 |
| Incident commander | Name |
| Postmortem owner | Name |
| Postmortem date | YYYY-MM-DD |
| Services affected | List each service |
| Customer impact | Description (e.g., "100% of checkout requests failed") |
| SLO impact | Error budget consumed (e.g., "0.08% of 30-day availability budget") |
Example:
| Field | Value |
|---|---|
| Incident ID | INC-2025-047 |
| Date and time detected | 2025-03-14 14:32 UTC |
| Date and time resolved | 2025-03-14 16:18 UTC |
| Total duration | 1 hour 46 minutes |
| Severity | SEV1 |
| Incident commander | Priya Sharma |
| Postmortem owner | Jordan Lee |
| Postmortem date | 2025-03-16 |
| Services affected | payments-api, order-service |
| Customer impact | 100% of payment attempts failed for 1h 46m |
| SLO impact | Consumed 87% of monthly error budget in one incident |
Summary
One paragraph. What happened, when, who was affected, and how it was resolved. Written for someone who was not involved. Avoid jargon.
Example:
On 2025-03-14 at 14:32 UTC, the payments-api began returning 503 errors for all requests. This was caused by database connection pool exhaustion triggered by a configuration change deployed at 14:15 UTC that reduced the connection pool max size from 100 to 10. All checkout attempts failed for 1 hour 46 minutes until the configuration was reverted at 16:18 UTC. Approximately 14,000 customers were unable to complete purchases during the window.
Timeline
List events in chronological order with UTC timestamps. Include the detection, escalation, diagnosis, and resolution events. Include near-misses and things that helped recovery - not just failures.
| Time (UTC) | Event | Actor |
|---|---|---|
| YYYY-MM-DD HH:MM | ||
| YYYY-MM-DD HH:MM |
Example:
| Time (UTC) | Event | Actor |
|---|---|---|
| 2025-03-14 14:15 | Config change deployed: pool_max_connections reduced from 100 to 10 | CI/CD pipeline |
| 2025-03-14 14:32 | SLO burn rate alert fires: 14.4x burn rate on payments-api availability | Alerting system |
| 2025-03-14 14:35 | Primary on-call acks alert, begins investigation | Kenji Tanaka |
| 2025-03-14 14:41 | Error rate confirmed at 100%; incident declared SEV1; incident commander assigned | Kenji Tanaka |
| 2025-03-14 14:48 | Traces show all requests timing out at DB layer | Kenji Tanaka |
| 2025-03-14 15:02 | DB team joins call; connection pool exhaustion confirmed | DB on-call |
| 2025-03-14 15:20 | Root cause identified: pool_max_connections=10 in recent deploy | Priya Sharma |
| 2025-03-14 15:45 | Config revert prepared and reviewed | Kenji Tanaka, Priya Sharma |
| 2025-03-14 16:15 | Config revert deployed | CD pipeline |
| 2025-03-14 16:18 | Error rate returns to < 0.1%; incident resolved | Kenji Tanaka |
| 2025-03-14 16:30 | All-clear communication sent to customer support | Priya Sharma |
Root cause analysis
Ask "why" five times. Each answer becomes the input to the next question. Stop when you reach a systemic cause - something about processes, tooling, or design - not a person.
The five-why chain:
Why did the service fail?
-> Connection pool was exhausted; all DB requests timed out
Why was the connection pool exhausted?
-> pool_max_connections was set to 10, far below the 100 connections needed at peak load
Why was pool_max_connections set to 10?
-> A config change in the deploy reduced it from 100 to 10
Why did the config change ship with an incorrect value?
-> The config value was changed to 10 (not 100) when cleaning up a test environment config,
and no automated validation checked the value range before deploy
Why was there no validation?
-> The configuration system has no schema enforcement or range validation on pool settingsRoot cause (systemic): The configuration deployment pipeline lacks validation that enforces minimum and maximum bounds on critical infrastructure parameters. There was no automated guard between an incorrect configuration value and production.
Detection
How was the incident detected? How long after it started? Could it have been caught sooner?
Questions to answer:
- Was the incident detected by an alert, a customer report, or manual discovery?
- How long was the service degraded before detection?
- Did the alert fire at the right severity?
- Was the runbook link in the alert? Was the runbook accurate?
- What could have detected this sooner?
Example:
The burn rate alert fired 17 minutes after the config was deployed. Detection was automated and appropriate. However, the runbook linked from the alert did not include steps for diagnosing connection pool issues - the responder had to search Slack history to find the DB team's contact. The 17-minute window could be reduced: a canary analysis check during deploy could have caught the pool exhaustion before reaching 100% of traffic.
Response
How long did mitigation take? What slowed it down? What helped?
Questions to answer:
- From alert to mitigation: what was the elapsed time? Was that acceptable?
- What information was missing at the start of the investigation?
- Were the right people in the incident call quickly enough?
- Was communication to stakeholders/customers timely?
- What tools or runbooks saved time? What was missing?
Example:
Time-to-mitigate was 1 hour 43 minutes from detection. The longest delay was 35 minutes identifying the root cause, because:
- The connection pool metrics were not on the default service dashboard
- The DB team had to be manually pulled in; no automated escalation path existed for DB-layer incidents
What helped: the incident commander used the standard war room template, which kept communication structured. The config diff was easy to find because all deploys are tagged with a commit hash in the config store.
Impact assessment
Quantify the impact across customer experience, business metrics, and reliability targets.
| Dimension | Impact |
|---|---|
| Users affected | Estimated or exact count |
| Requests failed | Total failed / total expected |
| Revenue impact | Estimate (if applicable) |
| SLO budget consumed | % of monthly budget |
| Secondary systems affected | List |
| Data integrity impact | Any data loss or corruption? |
Example:
| Dimension | Impact |
|---|---|
| Users affected | ~14,200 unique users (based on session counts during window) |
| Requests failed | 847,000 / 847,000 checkout requests (100%) |
| Revenue impact | ~$340,000 GMV blocked (not permanently lost; retried after resolution) |
| SLO budget consumed | 87% of monthly error budget consumed in one incident |
| Secondary systems affected | Order-service (dependent on payments-api); failed gracefully |
| Data integrity impact | No data loss; all failed transactions rolled back cleanly |
Contributing factors
Not the root cause - but conditions that made the incident more likely, more severe, or harder to detect and resolve. Each factor is a separate improvement opportunity.
List each factor as a sentence describing the systemic condition:
- The configuration deployment pipeline had no validation for parameter bounds
- Connection pool metrics were absent from the service's primary dashboard
- The runbook for availability alerts did not cover DB layer diagnosis
- There was no automated canary analysis step in the config deployment pipeline
- The escalation path for DB-layer incidents was not documented
Action items
Each action item must have: a description, an owner (person, not team), a due date, and a clear definition of done. "Improve X" is not an action item.
| ID | Action | Owner | Due date | Status | Definition of done |
|---|---|---|---|---|---|
| AI-001 | Open | ||||
| AI-002 | Open |
Example:
| ID | Action | Owner | Due date | Status | Definition of done |
|---|---|---|---|---|---|
| AI-001 | Add schema validation with min/max bounds to config deployment pipeline for all DB connection pool parameters | Jordan Lee | 2025-04-04 | Open | CI pipeline rejects any config with pool_max_connections < 20 or > 500; tested with a deliberately bad config |
| AI-002 | Add connection pool utilization panel (current/max, wait time) to payments-api service dashboard | Kenji Tanaka | 2025-03-28 | Open | Dashboard panel live in Grafana, verified against staging traffic |
| AI-003 | Update availability alert runbook to include DB connection pool diagnosis steps | Kenji Tanaka | 2025-03-21 | Open | Runbook has a "Check DB connection pool" section with commands and expected output |
| AI-004 | Add canary analysis step to config deploy pipeline checking error rate before promoting to 100% | Priya Sharma | 2025-04-18 | Open | Config deploys pause at 5% traffic for 5 minutes; auto-rollback if error rate > baseline + 1% |
| AI-005 | Document DB team escalation path in on-call handbook and link from all DB-related alerts | Priya Sharma | 2025-03-21 | Open | On-call handbook has "DB layer incidents" section; all DB alerts have escalation contact |
What went well
Explicitly document what worked. Reinforcing good practices is as important as fixing gaps. This section prevents the meeting from becoming purely negative.
Example:
- Burn rate alerting fired within 17 minutes of incident start - fast enough for automatic detection
- The incident commander kept a clear timeline in real-time, which made this postmortem significantly easier to write
- All failed transactions rolled back cleanly - no data integrity work required after resolution
- Customer support was notified within 20 minutes of incident declaration and had accurate status updates throughout
Lessons learned
High-level principles the team is taking away. Not action items - these are insights that change how the team thinks. Useful for sharing across teams.
Example:
- Configuration changes are code changes: they need the same validation, review, and canary deployment treatment as binary changes
- Dashboard completeness is an on-call SLA: if it is not on the dashboard, it will not be checked during an incident under pressure
- The postmortem process worked: having a designated incident commander and real-time timeline shortened the postmortem meeting by an estimated 30 minutes
Follow-up review date
Set a date to review the status of action items. Default: 30 days after postmortem.
Next review: YYYY-MM-DD Review owner: Name (verify all action items are complete or have updated owners/dates)
Quick-reference: postmortem checklist
During the incident:
- Designate an incident commander
- Start a shared timeline document immediately
- Note every significant event with a timestamp
Within 24 hours of resolution:
- Draft timeline shared with participants for corrections
- Postmortem owner assigned
- Meeting scheduled within 48 hours
At the meeting:
- Facilitator opens with blameless framing
- Timeline reviewed and finalized
- Root cause chain completed (5 whys)
- Contributing factors listed
- Action items have owners and due dates
Within 5 days:
- Final postmortem published internally
- All action items entered in issue tracker
- Summary shared with affected stakeholders
- SLO impact documented in reliability dashboard
30-day review:
- All action items complete or rescheduled with explanation
- Lessons learned shared with broader engineering organization
Frequently Asked Questions
What is site-reliability?
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
How do I install site-reliability?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support site-reliability?
site-reliability works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.