site-reliability

Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.

What is site-reliability?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The site-reliability skill is now active and ready to use

Overview Files

site-reliability

site-reliability is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability.

Quick Facts

Field	Value
Category	engineering
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability

The site-reliability skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

SRE is the discipline of applying software engineering to operations problems. It replaces ad-hoc ops work with principled systems: reliability targets backed by error budgets, toil replaced by automation, and incidents treated as system failures rather than human ones. This skill covers the full SRE lifecycle - from defining SLOs through capacity planning and progressive delivery - as practiced by teams operating production systems at scale. Designed for engineers moving from "keep the lights on" to systematic reliability ownership.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair site-reliability with these complementary skills:

Frequently Asked Questions

What is site-reliability?

How do I install site-reliability?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support site-reliability?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Site Reliability Engineering

When to use this skill

Trigger this skill when the user:

Needs to define or revise SLOs, SLIs, or SLAs for a service
Is calculating or acting on an error budget
Wants to identify, measure, or automate toil
Is running or writing a postmortem
Is designing or improving an on-call rotation
Is forecasting capacity needs or planning a load test
Is designing a rollout strategy (canary, blue/green, progressive)

Do NOT trigger this skill for:

Pure infrastructure provisioning without a reliability framing (use a Docker/K8s skill)
Application performance optimization without an SLO context (use a performance-engineering skill)

Key principles

Embrace risk with error budgets - 100% reliability is neither achievable nor desirable. Every extra nine of availability comes at a cost: slower feature velocity, more complex systems, higher operational burden. An error budget makes the trade-off explicit: spend budget on risk-taking (deploys, experiments), save it when reliability is threatened.
Eliminate toil - Toil is work that is manual, repetitive, automatable, reactive, and scales with service growth without producing lasting value. Every hour of toil is an hour not spent on reliability improvements. The goal is not zero toil (some is unavoidable) but continuous reduction.
SLOs are the contract - SLOs align engineering and business on what reliability is worth. They prevent both over-engineering ("five nines or nothing") and under-investing ("it mostly works"). Write SLOs before writing on-call runbooks; the SLO defines what warrants waking someone up.
Blameless postmortems - Systems fail, not people. Blaming individuals creates an environment where engineers hide problems and avoid risk. Blameless postmortems surface systemic issues and produce durable fixes. The goal is learning, not accountability theater.
Automate yourself out of a job - The SRE charter is to automate operations work until the team's operational load is below 50% of their time. The remaining capacity is reserved for reliability engineering that makes the next incident less likely or less severe.

Core concepts

SLI / SLO / SLA hierarchy

SLA (Service Level Agreement)
  - External contract with customers. Breach triggers penalties.
  - Set conservatively: your internal SLO must be tighter than your SLA.

  SLO (Service Level Objective)
    - Internal target. Drives alerting, error budgets, and engineering decisions.
    - Typically SLO = SLA - 0.5 to 1 percentage point headroom.

    SLI (Service Level Indicator)
      - The actual measurement. A ratio: good events / total events.
      - Example: (requests completing < 300ms) / (all requests)

Rule of thumb: Define one availability SLI and one latency SLI per user-facing service. Add correctness SLIs for data pipelines or financial systems.

Error budget mechanics

Error budget = 1 - SLO target
  99.9% SLO  -> 0.1% budget  -> 43.8 min/month at risk
  99.5% SLO  -> 0.5% budget  -> 3.65 hours/month at risk

Budget consumed = (bad events this window) / (total events this window)
Budget remaining = budget_total - budget_consumed

Burn rate = observed error rate / allowed error rate. A burn rate of 1 means you are spending budget at exactly the expected pace. A burn rate of 14.4 on a 30-day window means the budget is gone in 50 hours.

Budget policy (what to do when budget is threatened):

Budget remaining	Action
> 50%	Normal feature velocity, deploys allowed
25-50%	Review recent changes, increase monitoring
10-25%	Freeze non-essential deploys, focus on stability
< 10%	Feature freeze, all hands on reliability work

Toil definition

Toil has all of these properties - if even one is missing, it may be legitimate work:

Manual: A human is in the loop doing repetitive keystrokes
Repetitive: Done more than once with the same steps
Automatable: A script or system could do it
Reactive: Triggered by a system event, not proactive engineering
No lasting value: Executing it does not improve the system; it just holds it steady
Scales with load: More traffic, more toil (a danger sign)

Incident severity levels

Severity	Customer impact	Response	Example
SEV1	Complete outage or data loss	Immediate page, war room	Payment service down
SEV2	Degraded core functionality	Page on-call	20% of requests erroring
SEV3	Minor degradation, workaround exists	Ticket, next business day	Slow dashboard loads
SEV4	Cosmetic issue or internal tool	Backlog	Wrong label in admin UI

On-call best practices

Rotate weekly; never longer than two weeks without a break
Guarantee engineers sleep: no P1 pages between 10pm-8am without escalation
Track on-call load: pages per shift, time-to-ack, total hours interrupted
Every on-call shift ends with a handoff: active incidents, lingering alerts, context
Budget 20-30% of the next sprint for on-call follow-up work

Common tasks

Define SLOs for a service

Step 1: Choose the right SLIs. Start from user journeys, not technical metrics.

User journey	SLI type	Measurement
"Page loads fast"	Latency	requests_under_300ms / total_requests
"API calls succeed"	Availability	non_5xx_responses / total_responses
"Data is correct"	Correctness	correct_outputs / total_outputs
"Writes persist"	Durability	successful_writes_verified / total_writes

Step 2: Set targets using historical data.

1. Pull 30 days of your current SLI measurements
2. Find your current actual performance (e.g., 99.85% availability)
3. Set SLO slightly below current performance (e.g., 99.7%)
4. Tighten over time as you improve reliability

Never set an SLO tighter than your best recent 30-day window without a corresponding reliability investment plan.

Step 3: Choose the window. Rolling 30-day windows are standard. They smooth spikes but respond to sustained degradation. Avoid calendar month windows - they reset budgets on the 1st regardless of what happened on the 31st.

Step 4: Define measurement exclusions. Planned maintenance, dependencies outside your control, and client errors (4xx) typically excluded from SLI calculations.

Calculate and track error budgets

Burn rate alerting (recommended over threshold alerting):

Fast burn alert (page immediately):
  Condition: burn_rate > 14.4 for 5 minutes
  Meaning:   At this rate, 30-day budget exhausted in ~50 hours
  Severity:  SEV2, page on-call

Slow burn alert (ticket, investigate):
  Condition: burn_rate > 3 for 60 minutes
  Meaning:   Budget exhausted in ~10 days if trend continues
  Severity:  SEV3, create ticket

Budget depletion alert (SEV1 escalation trigger):
  Condition: budget_remaining < 10%
  Action:    Feature freeze, reliability sprint

Multi-window alerting catches both fast spikes and slow degradation:

5-minute window: catches fast burns (major incident)
1-hour window: catches slow burns (creeping degradation)
Both windows alerting together = high-confidence page

Budget depletion actions:

Stop all non-essential deploys
Pull toil-reduction and reliability items from the backlog
Review the postmortem queue for unresolved action items
Document the decision with date and budget percentage in your incident tracker

Identify and reduce toil

Toil taxonomy - classify before automating:

Category	Examples	Priority
Interrupt-driven	Restarting crashed pods, clearing queues	High - on-call tax
Regular manual ops	Weekly capacity checks, certificate renewals	Medium - scheduled work
Deploy ceremony	Manual release steps, environment promotion	High - blocks velocity
Data cleanup	Fixing bad records, reconciliation jobs	Medium - correctness risk
Access management	Provisioning accounts, rotating credentials	High - security risk

Automation prioritization matrix:

                 HIGH FREQUENCY
                      |
  Quick to            |              Slow to
  automate            |              automate
                      |
 AUTOMATE FIRST  -----+-----  SCHEDULE: PLAN PROJECT
                      |
                      |
 AUTOMATE WHEN   -----+-----  ACCEPT OR ELIMINATE
  CONVENIENT          |
                      |
                 LOW FREQUENCY

Measure toil before and after automation: track hours/week per category per engineer. If toil is growing, the automation is not keeping pace with service growth.

Run a blameless postmortem

When to hold one: Every SEV1. Every SEV2 with customer-visible impact. Any incident that consumed more than 4 hours of on-call time. Recurring SEV3s from the same root cause.

Timeline (24-48 hours after resolution):

Day 0 (during incident): Designate incident commander, keep a timeline in a shared doc
Day 1 (next morning):    Assign postmortem owner, schedule meeting within 48 hours
Day 2 (postmortem):      60-90 min facilitated session
Day 3:                   Draft published internally for 24-hour comment period
Day 5:                   Final version published, action items entered in tracker

The five questions that drive every postmortem:

What happened and when? (timeline)
Why did it happen? (root cause - ask "why" five times)
Why did we not detect it sooner? (detection gap)
What slowed down the response? (mitigation gap)
What prevents recurrence? (action items)

Action item rules: Each item must have an owner, a due date, and a measurable definition of done. "Improve monitoring" is not an action item. "Add burn-rate alert for payments-api availability SLO by 2025-Q3" is.

See references/postmortem-template.md for the full template with example entries and facilitation guide.

Design on-call rotation

Rotation structure:

Primary on-call:   First responder. Acks within 15 min, mitigates or escalates.
Secondary on-call: Backup if primary misses ack within 15 min.
Escalation path:   Engineering manager -> Director -> Incident commander (for SEV1 only)

Runbook requirements (every alert must have one):

Symptom: what the alert is telling you
Impact: who is affected and how severely
Steps: numbered investigation and mitigation steps
Escalation: who to call if steps do not resolve it
Context: links to dashboards, service documentation, past incidents

Handoff process (end of each on-call rotation):

Document any open or lingering issues
List any alerts that fired but did not page (worth reviewing)
Share known fragile areas or upcoming risky changes
Review toil hours and open action items with incoming on-call

Health metrics for on-call load:

Metric	Target	Alert threshold
Pages per on-call week	< 5	> 10
Pages outside business hours	< 2/week	> 5/week
Time-to-ack (P1)	< 5 min	> 15 min
Toil percentage of on-call time	< 50%	> 70%

Plan capacity

Demand forecasting approach:

1. Baseline: measure current peak RPS, CPU, memory, storage
2. Growth rate: calculate month-over-month traffic growth (last 6 months)
3. Project forward: apply growth rate to 6-month and 12-month horizons
4. Add headroom: 30-50% above projected peak for burst capacity
5. Trigger threshold: the utilization level that kicks off provisioning

Load testing before capacity decisions:

Define the traffic shape (ramp, steady state, spike)
Test to 150% of expected peak - find the breaking point before users do
Measure: latency distribution at load, error rate at load, resource utilization
Identify the bottleneck (CPU, DB connections, memory) before scaling the wrong thing

Headroom planning table:

Component	Trigger utilization	Target utilization	Action
Compute (CPU)	> 70% sustained	40-60%	Horizontal scale
Memory	> 80%	50-70%	Vertical scale or tune GC
Database (connections)	> 80% pool use	50-70%	Connection pooler, scale up
Storage	> 75%	< 60%	Provision more, archive old data
Network throughput	> 70%	< 50%	Scale or upgrade links

Cost vs reliability trade-off: Headroom is expensive. Justify each component's target with an SLO - a 99.9% availability SLO for a stateless service does not require the same headroom as a 99.99% SLO for a payment processor.

Implement progressive rollouts

Rollout ladder:

0.1% canary (10 min)
  -> 1% (30 min, review metrics)
  -> 5% (1 hour)
  -> 25% (1 hour)
  -> 50% (1 hour)
  -> 100%

Canary analysis - automatic promotion/rollback criteria:

Signal	Rollback if	Promote if
Error rate	Canary > baseline + 0.5%	Canary <= baseline + 0.1%
p99 latency	Canary > baseline * 1.2	Canary <= baseline * 1.05
SLO burn rate	Canary burn rate > 5x	Canary burn rate <= 2x
CPU/Memory	Canary > baseline * 1.3	Within 10% of baseline

Automated rollback triggers: Instrument your CD pipeline to roll back automatically when error rate or latency breaches the canary threshold. Do not rely on humans to catch canary regressions - the whole point is to automate the decision. If your deployment tool does not support automated rollback, treat that as a toil item to fix.

Feature flags vs canary: Canary deploys test infrastructure changes (binary, container, config). Feature flags test product changes (code paths). Use both. Separate the risk of deploying new infrastructure from the risk of activating new behavior.

Gotchas

SLO window reset on the 1st creates budget gaming - Calendar month windows reset error budget on the 1st regardless of what happened on the 31st. Teams learn to push risky deploys right after reset. Use rolling 30-day windows which are always live and cannot be gamed.
Burn rate alerts with a single window produce too much noise - A 5-minute burn rate alert alone generates pages for transient spikes that self-recover. Multi-window alerting (5-minute AND 1-hour both elevated) dramatically reduces false positives while keeping sensitivity to real incidents.
Toil metrics without a reduction target are just bookkeeping - Measuring toil hours without committing to a reduction target and a sprint allocation to address it creates awareness without action. The measure only has value if it gates a quarterly automation investment.
Canary rollout with no automated rollback is manual canary - A canary that requires a human to notice the error rate spike and manually roll back is not a canary - it is a staged rollout with extra steps. Automated rollback on threshold breach is the defining property; without it, the safety benefit is largely absent.
On-call runbooks that say "escalate to engineering" - A runbook whose resolution step is "page someone else" does not reduce on-call burden; it just shifts it. Every runbook must include at least one concrete mitigation step the on-call can take before escalating.

Anti-patterns / common mistakes

Mistake	Why it is wrong	What to do instead
Setting SLOs without historical data	Targets become aspirational fiction, not engineering constraints	Measure current performance first, set SLO at or slightly below it
Alerting on resource utilization not SLOs	CPU at 90% may not affect users; 1% error rate definitely does	Alert on SLO burn rate; use resource metrics for capacity planning only
Blameful postmortems	Engineers hide problems, avoid risky-but-necessary changes	Explicitly state "no blame" in the template; focus every question on systems
Counting toil in hours but not automating it	Creates awareness without action	Budget one sprint per quarter specifically for toil reduction
Infinite error budget freezes	Teams freeze deploys forever, killing velocity	Define explicit budget policy with percentage thresholds and time-bounded freezes
On-call without runbooks	Every incident requires heroics; knowledge stays in individuals	Treat "alert without runbook" as a blocker; write the runbook during the incident

References

For detailed guidance on specific domains, load the relevant file from references/:

references/postmortem-template.md - full postmortem template with example entries, facilitation guide, and action item tracker

Only load a references file when the current task requires it.

References

postmortem-template.md

Postmortem Template

A blameless postmortem is a structured learning exercise, not an accountability hearing. The goal is to understand what happened, improve the system, and prevent recurrence. Every question in this template focuses on systems, processes, and tooling - not on individuals.

Facilitation Guide

Before the meeting:

Designate one facilitator (neutral; preferably not the incident commander)
Designate one scribe (takes verbatim notes, not the facilitator)
Share the draft timeline 24 hours before so attendees can correct it
Invite: incident responders, on-call, service owners, one representative from affected teams
Block 60-90 minutes; complex incidents need 90

During the meeting:

Open with: "This is a learning session. There is no blame here."
Use "the system did X" not "you did X"
When discussion gets heated: redirect to "what could the system have done differently?"
Time-box root cause discussion to 30 minutes; spend remaining time on action items
If someone is defensive, ask: "What would have had to be true for anyone in that position to have done differently?"

After the meeting:

Publish draft within 24 hours for comment
Finalize and share within 5 days of incident
Enter all action items into your issue tracker with owners and due dates
Review open action items at the start of every sprint retrospective

Postmortem Document

Incident metadata

Field	Value
Incident ID	INC-YYYY-NNN
Date and time detected	YYYY-MM-DD HH:MM UTC
Date and time resolved	YYYY-MM-DD HH:MM UTC
Total duration	X hours Y minutes
Severity	SEV1 / SEV2 / SEV3
Incident commander	Name
Postmortem owner	Name
Postmortem date	YYYY-MM-DD
Services affected	List each service
Customer impact	Description (e.g., "100% of checkout requests failed")
SLO impact	Error budget consumed (e.g., "0.08% of 30-day availability budget")

Example:

Field	Value
Incident ID	INC-2025-047
Date and time detected	2025-03-14 14:32 UTC
Date and time resolved	2025-03-14 16:18 UTC
Total duration	1 hour 46 minutes
Severity	SEV1
Incident commander	Priya Sharma
Postmortem owner	Jordan Lee
Postmortem date	2025-03-16
Services affected	payments-api, order-service
Customer impact	100% of payment attempts failed for 1h 46m
SLO impact	Consumed 87% of monthly error budget in one incident

Summary

One paragraph. What happened, when, who was affected, and how it was resolved. Written for someone who was not involved. Avoid jargon.

Example:

On 2025-03-14 at 14:32 UTC, the payments-api began returning 503 errors for all requests. This was caused by database connection pool exhaustion triggered by a configuration change deployed at 14:15 UTC that reduced the connection pool max size from 100 to 10. All checkout attempts failed for 1 hour 46 minutes until the configuration was reverted at 16:18 UTC. Approximately 14,000 customers were unable to complete purchases during the window.

Timeline

List events in chronological order with UTC timestamps. Include the detection, escalation, diagnosis, and resolution events. Include near-misses and things that helped recovery - not just failures.

Time (UTC)	Event	Actor
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM

Example:

Time (UTC)	Event	Actor
2025-03-14 14:15	Config change deployed: pool_max_connections reduced from 100 to 10	CI/CD pipeline
2025-03-14 14:32	SLO burn rate alert fires: 14.4x burn rate on payments-api availability	Alerting system
2025-03-14 14:35	Primary on-call acks alert, begins investigation	Kenji Tanaka
2025-03-14 14:41	Error rate confirmed at 100%; incident declared SEV1; incident commander assigned	Kenji Tanaka
2025-03-14 14:48	Traces show all requests timing out at DB layer	Kenji Tanaka
2025-03-14 15:02	DB team joins call; connection pool exhaustion confirmed	DB on-call
2025-03-14 15:20	Root cause identified: pool_max_connections=10 in recent deploy	Priya Sharma
2025-03-14 15:45	Config revert prepared and reviewed	Kenji Tanaka, Priya Sharma
2025-03-14 16:15	Config revert deployed	CD pipeline
2025-03-14 16:18	Error rate returns to < 0.1%; incident resolved	Kenji Tanaka
2025-03-14 16:30	All-clear communication sent to customer support	Priya Sharma

Root cause analysis

Ask "why" five times. Each answer becomes the input to the next question. Stop when you reach a systemic cause - something about processes, tooling, or design - not a person.

The five-why chain:

Why did the service fail?
  -> Connection pool was exhausted; all DB requests timed out

Why was the connection pool exhausted?
  -> pool_max_connections was set to 10, far below the 100 connections needed at peak load

Why was pool_max_connections set to 10?
  -> A config change in the deploy reduced it from 100 to 10

Why did the config change ship with an incorrect value?
  -> The config value was changed to 10 (not 100) when cleaning up a test environment config,
     and no automated validation checked the value range before deploy

Why was there no validation?
  -> The configuration system has no schema enforcement or range validation on pool settings

Root cause (systemic): The configuration deployment pipeline lacks validation that enforces minimum and maximum bounds on critical infrastructure parameters. There was no automated guard between an incorrect configuration value and production.

Detection

How was the incident detected? How long after it started? Could it have been caught sooner?

Questions to answer:

Was the incident detected by an alert, a customer report, or manual discovery?
How long was the service degraded before detection?
Did the alert fire at the right severity?
Was the runbook link in the alert? Was the runbook accurate?
What could have detected this sooner?

Example:

The burn rate alert fired 17 minutes after the config was deployed. Detection was automated and appropriate. However, the runbook linked from the alert did not include steps for diagnosing connection pool issues - the responder had to search Slack history to find the DB team's contact. The 17-minute window could be reduced: a canary analysis check during deploy could have caught the pool exhaustion before reaching 100% of traffic.

Response

How long did mitigation take? What slowed it down? What helped?

Questions to answer:

From alert to mitigation: what was the elapsed time? Was that acceptable?
What information was missing at the start of the investigation?
Were the right people in the incident call quickly enough?
Was communication to stakeholders/customers timely?
What tools or runbooks saved time? What was missing?

Example:

Time-to-mitigate was 1 hour 43 minutes from detection. The longest delay was 35 minutes identifying the root cause, because:

The connection pool metrics were not on the default service dashboard
The DB team had to be manually pulled in; no automated escalation path existed for DB-layer incidents

What helped: the incident commander used the standard war room template, which kept communication structured. The config diff was easy to find because all deploys are tagged with a commit hash in the config store.

Impact assessment

Quantify the impact across customer experience, business metrics, and reliability targets.

Dimension	Impact
Users affected	Estimated or exact count
Requests failed	Total failed / total expected
Revenue impact	Estimate (if applicable)
SLO budget consumed	% of monthly budget
Secondary systems affected	List
Data integrity impact	Any data loss or corruption?

Example:

Dimension	Impact
Users affected	~14,200 unique users (based on session counts during window)
Requests failed	847,000 / 847,000 checkout requests (100%)
Revenue impact	~$340,000 GMV blocked (not permanently lost; retried after resolution)
SLO budget consumed	87% of monthly error budget consumed in one incident
Secondary systems affected	Order-service (dependent on payments-api); failed gracefully
Data integrity impact	No data loss; all failed transactions rolled back cleanly

Contributing factors

Not the root cause - but conditions that made the incident more likely, more severe, or harder to detect and resolve. Each factor is a separate improvement opportunity.

List each factor as a sentence describing the systemic condition:

The configuration deployment pipeline had no validation for parameter bounds
Connection pool metrics were absent from the service's primary dashboard
The runbook for availability alerts did not cover DB layer diagnosis
There was no automated canary analysis step in the config deployment pipeline
The escalation path for DB-layer incidents was not documented

Action items

Each action item must have: a description, an owner (person, not team), a due date, and a clear definition of done. "Improve X" is not an action item.

ID	Action	Owner	Due date	Status	Definition of done
AI-001				Open
AI-002				Open

Example:

ID	Action	Owner	Due date	Status	Definition of done
AI-001	Add schema validation with min/max bounds to config deployment pipeline for all DB connection pool parameters	Jordan Lee	2025-04-04	Open	CI pipeline rejects any config with pool_max_connections < 20 or > 500; tested with a deliberately bad config
AI-002	Add connection pool utilization panel (current/max, wait time) to payments-api service dashboard	Kenji Tanaka	2025-03-28	Open	Dashboard panel live in Grafana, verified against staging traffic
AI-003	Update availability alert runbook to include DB connection pool diagnosis steps	Kenji Tanaka	2025-03-21	Open	Runbook has a "Check DB connection pool" section with commands and expected output
AI-004	Add canary analysis step to config deploy pipeline checking error rate before promoting to 100%	Priya Sharma	2025-04-18	Open	Config deploys pause at 5% traffic for 5 minutes; auto-rollback if error rate > baseline + 1%
AI-005	Document DB team escalation path in on-call handbook and link from all DB-related alerts	Priya Sharma	2025-03-21	Open	On-call handbook has "DB layer incidents" section; all DB alerts have escalation contact

What went well

Explicitly document what worked. Reinforcing good practices is as important as fixing gaps. This section prevents the meeting from becoming purely negative.

Example:

Burn rate alerting fired within 17 minutes of incident start - fast enough for automatic detection
The incident commander kept a clear timeline in real-time, which made this postmortem significantly easier to write
All failed transactions rolled back cleanly - no data integrity work required after resolution
Customer support was notified within 20 minutes of incident declaration and had accurate status updates throughout

Lessons learned

High-level principles the team is taking away. Not action items - these are insights that change how the team thinks. Useful for sharing across teams.

Example:

Configuration changes are code changes: they need the same validation, review, and canary deployment treatment as binary changes
Dashboard completeness is an on-call SLA: if it is not on the dashboard, it will not be checked during an incident under pressure
The postmortem process worked: having a designated incident commander and real-time timeline shortened the postmortem meeting by an estimated 30 minutes

Follow-up review date

Set a date to review the status of action items. Default: 30 days after postmortem.

Next review: YYYY-MM-DD Review owner: Name (verify all action items are complete or have updated owners/dates)

Quick-reference: postmortem checklist

During the incident:

Designate an incident commander
Start a shared timeline document immediately
Note every significant event with a timestamp

Within 24 hours of resolution:

Draft timeline shared with participants for corrections
Postmortem owner assigned
Meeting scheduled within 48 hours

At the meeting:

Facilitator opens with blameless framing
Timeline reviewed and finalized
Root cause chain completed (5 whys)
Contributing factors listed
Action items have owners and due dates

Within 5 days:

Final postmortem published internally
All action items entered in issue tracker
Summary shared with affected stakeholders
SLO impact documented in reliability dashboard

30-day review:

All action items complete or rescheduled with explanation
Lessons learned shared with broader engineering organization

Frequently Asked Questions

What is site-reliability?

How do I install site-reliability?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill site-reliability in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support site-reliability?

site-reliability works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is site-reliability free?

Yes, site-reliability is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between site-reliability and similar tools?

site-reliability is an AI agent skill that teaches your coding agent specialized software engineering knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use site-reliability with Cursor or Windsurf?

site-reliability works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

site-reliability

What is site-reliability?

Quick Start

site-reliability

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is site-reliability?

How do I install site-reliability?

What AI agents support site-reliability?

Maintainers

SKILL.md

Site Reliability Engineering

When to use this skill

Key principles

Core concepts

SLI / SLO / SLA hierarchy

Error budget mechanics

Toil definition

Incident severity levels

On-call best practices

Common tasks

Define SLOs for a service

Calculate and track error budgets

Identify and reduce toil

Run a blameless postmortem

Design on-call rotation

Plan capacity

Implement progressive rollouts

Gotchas

Anti-patterns / common mistakes

References

References

postmortem-template.md

Postmortem Template

Facilitation Guide

Postmortem Document

Incident metadata

Summary

Timeline

Root cause analysis

Detection

Response

Impact assessment

Contributing factors

Action items

What went well

Lessons learned

Follow-up review date

Quick-reference: postmortem checklist

Frequently Asked Questions

What is site-reliability?

How do I install site-reliability?

What AI agents support site-reliability?

Is site-reliability free?

What is the difference between site-reliability and similar tools?

Can I use site-reliability with Cursor or Windsurf?