internal-docs
Use this skill when writing, reviewing, or improving internal engineering documents - RFCs, design docs, post-mortems, runbooks, and knowledge base articles. Triggers on drafting a design proposal, writing an RFC, creating a post-mortem after an incident, building an operational runbook, organizing team knowledge, or improving existing documentation for clarity and completeness.
writing rfcdesign-docspost-mortemrunbookknowledge-managementdocumentationWhat is internal-docs?
Use this skill when writing, reviewing, or improving internal engineering documents - RFCs, design docs, post-mortems, runbooks, and knowledge base articles. Triggers on drafting a design proposal, writing an RFC, creating a post-mortem after an incident, building an operational runbook, organizing team knowledge, or improving existing documentation for clarity and completeness.
internal-docs
internal-docs is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex, and 1 more. Writing, reviewing, or improving internal engineering documents - RFCs, design docs, post-mortems, runbooks, and knowledge base articles.
Quick Facts
| Field | Value |
|---|---|
| Category | writing |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex, mcp |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill internal-docs- The internal-docs skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Internal documentation is the connective tissue of engineering organizations. It captures decisions (RFCs, design docs), preserves operational knowledge (runbooks), extracts lessons from failure (post-mortems), and makes institutional knowledge discoverable (knowledge management). This skill gives an agent the ability to draft, review, and improve internal documents that are clear, actionable, and structured for their specific audience - from a 2-page RFC to a detailed incident post-mortem.
Tags
rfc design-docs post-mortem runbook knowledge-management documentation
Platforms
- claude-code
- gemini-cli
- openai-codex
- mcp
Related Skills
Pair internal-docs with these complementary skills:
Frequently Asked Questions
What is internal-docs?
Use this skill when writing, reviewing, or improving internal engineering documents - RFCs, design docs, post-mortems, runbooks, and knowledge base articles. Triggers on drafting a design proposal, writing an RFC, creating a post-mortem after an incident, building an operational runbook, organizing team knowledge, or improving existing documentation for clarity and completeness.
How do I install internal-docs?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill internal-docs in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support internal-docs?
This skill works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Internal Docs
Internal documentation is the connective tissue of engineering organizations. It captures decisions (RFCs, design docs), preserves operational knowledge (runbooks), extracts lessons from failure (post-mortems), and makes institutional knowledge discoverable (knowledge management). This skill gives an agent the ability to draft, review, and improve internal documents that are clear, actionable, and structured for their specific audience - from a 2-page RFC to a detailed incident post-mortem.
When to use this skill
Trigger this skill when the user:
- Wants to write or draft an RFC or design document
- Needs to create a post-mortem or incident review document
- Asks to build an operational runbook or playbook
- Wants to organize or structure a team knowledge base
- Needs to review an existing internal doc for completeness or clarity
- Asks about documentation templates, formats, or best practices
- Wants to write an ADR (Architecture Decision Record)
- Needs to create onboarding documentation or team guides
Do NOT trigger this skill for:
- Public-facing API documentation or developer docs (use api-design skill)
- README files or open-source project documentation (use code-level docs conventions)
Key principles
Write for the reader, not the writer - Every document exists to transfer knowledge to someone else. Identify who will read it (decision-makers, on-call engineers, new hires) and structure for their needs, not your thought process.
Decisions over descriptions - The most valuable internal docs capture the "why" behind choices. A design doc that only describes the solution without explaining alternatives considered and tradeoffs made is incomplete.
Actionability is everything - A runbook that says "investigate the issue" is worthless. A post-mortem without concrete action items is theater. Every document should leave the reader knowing exactly what to do next.
Living documents decay - Docs that aren't maintained become dangerous. Every document needs an owner and a review cadence, or it should be marked with an explicit expiration date.
Structure enables skimming - Engineers don't read docs linearly. Use headers, TL;DRs, tables, and callouts so readers can find what they need in under 30 seconds.
Core concepts
Internal docs fall into four categories, each with a distinct lifecycle and audience:
Decision documents (RFCs, design docs, ADRs) propose a change, gather feedback,
and record the final decision. They flow through draft, review, approved/rejected
states. The audience is peers and stakeholders who need to evaluate the proposal.
See references/rfcs-and-design-docs.md.
Incident documents (post-mortems, incident reviews) are written after something
goes wrong. They reconstruct the timeline, identify root causes, and produce action
items. The audience is the broader engineering org learning from failure. Blamelessness
is non-negotiable. See references/post-mortems.md.
Operational documents (runbooks, playbooks, SOPs) provide step-by-step procedures
for recurring tasks or incident response. The audience is the on-call engineer at
3 AM who needs to fix something fast. See references/runbooks.md.
Knowledge documents (wikis, guides, onboarding docs, team pages) preserve
institutional knowledge. The audience varies but typically includes new team members
and cross-team collaborators. See references/knowledge-management.md.
Common tasks
Draft an RFC
An RFC proposes a significant technical change and invites structured feedback. Use this template structure:
# RFC: <Title>
**Author:** <name> **Status:** Draft | In Review | Approved | Rejected
**Created:** <date> **Last updated:** <date>
**Reviewers:** <list> **Decision deadline:** <date>
## TL;DR
<2-3 sentences: what you propose and why>
## Motivation
<What problem does this solve? Why now? What happens if we do nothing?>
## Proposal
<The detailed solution. Include diagrams, data models, API contracts as needed.>
## Alternatives considered
<At least 2 alternatives with honest pros/cons for each>
## Tradeoffs and risks
<What are we giving up? What could go wrong? How do we mitigate?>
## Rollout plan
<How will this be implemented incrementally? Feature flags? Migration?>
## Open questions
<Unresolved items that need input from reviewers>Always include at least two genuine alternatives. A single-option RFC signals the decision was made before the review process started.
Write a post-mortem
Post-mortems extract organizational learning from incidents. Follow a blameless approach - focus on systems and processes, never on individuals.
# Post-Mortem: <Incident title>
**Date of incident:** <date> **Severity:** SEV-1 | SEV-2 | SEV-3
**Author:** <name> **Status:** Draft | Review | Final
**Time to detect:** <duration> **Time to resolve:** <duration>
## Summary
<3-4 sentences: what happened, who was affected, and the impact>
## Timeline
| Time (UTC) | Event |
|---|---|
| HH:MM | <what happened> |
## Root cause
<The deepest "why" - use the 5 Whys technique to go beyond symptoms>
## Contributing factors
<Other conditions that made the incident possible or worse>
## What went well
<Things that worked during response - detection, communication, tooling>
## What went poorly
<Process or system gaps exposed by the incident>
## Action items
| Action | Owner | Priority | Due date | Status |
|---|---|---|---|---|
| <specific action> | <name> | P0/P1/P2 | <date> | Open |Every action item must be specific, assigned, and dated. "Improve monitoring" is not an action item. "Add latency p99 alert on checkout service at 500ms threshold" is.
Create a runbook
Runbooks provide step-by-step procedures for operational tasks. Write them for the worst case: an engineer who has never seen this system, at 3 AM, under stress.
# Runbook: <Procedure name>
**Owner:** <team> **Last verified:** <date>
**Estimated time:** <duration> **Risk level:** Low | Medium | High
## When to use
<Trigger conditions - what alert, symptom, or request leads here>
## Prerequisites
- [ ] Access to <system>
- [ ] Permissions: <specific roles or credentials needed>
## Steps
### Step 1: <Action>
<Exact command or UI action. No ambiguity.>
```bash
kubectl get pods -n production -l app=checkoutExpected output:
Step 2:
...
Rollback
Escalation
<Who to contact if the runbook doesn't resolve the issue>
> Test every runbook by having someone unfamiliar with the system follow it.
> If they get stuck, the runbook is incomplete.
### Write an Architecture Decision Record (ADR)
ADRs are lightweight, immutable records of a single architectural decision.
```markdown
# ADR-<NNN>: <Decision title>
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR-<NNN>
**Date:** <date> **Deciders:** <names>
## Context
<What forces are at play? What constraint or opportunity triggered this decision?>
## Decision
<The change we are making. State it clearly in one paragraph.>
## Consequences
<What becomes easier? What becomes harder? What are the risks?>ADRs are append-only. If a decision is reversed, write a new ADR that supersedes the old one. Never edit a finalized ADR.
Review an existing document for quality
Walk through the doc checking these dimensions in order:
- Audience - Is it clear who this is for? Does the depth match their expertise?
- Structure - Can a reader find what they need by skimming headers?
- Completeness - Are there gaps that will generate questions?
- Actionability - Does the reader know what to do after reading?
- Freshness - Is the information current? Are there stale references?
- Conciseness - Can anything be cut without losing meaning?
Organize a knowledge base
Structure team knowledge around these four categories (adapted from Divio):
| Category | Purpose | Example |
|---|---|---|
| Tutorials | Learning-oriented, step-by-step | "Setting up local dev environment" |
| How-to guides | Task-oriented, problem-solving | "How to deploy a canary release" |
| Reference | Information-oriented, accurate | "API rate limits by tier" |
| Explanation | Understanding-oriented, context | "Why we chose event sourcing" |
Avoid dumping all docs into a flat wiki. Tag documents by category, team, and system so they remain discoverable as the org scales.
Anti-patterns / common mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
| Wall of text | No headers, no TL;DR, no structure - nobody will read it | Add TL;DR upfront, use headers every 3-5 paragraphs, use tables for structured data |
| Blame in post-mortems | Naming individuals creates fear and suppresses honest reporting | Focus on system and process failures. "The deploy pipeline lacked a canary step" not "Bob deployed without checking" |
| Runbook with "use judgment" | On-call engineers under stress cannot exercise judgment on unfamiliar systems | Provide explicit decision trees with concrete thresholds |
| RFC without alternatives | Signals the decision is already made and review is theater | Always include 2+ genuine alternatives with honest tradeoffs |
| Stale documentation | Outdated docs are worse than no docs - they build false confidence | Set review dates, assign owners, archive aggressively |
| Copy-paste templates | Filling a template mechanically without adapting to context | Templates are starting points - remove irrelevant sections, add context-specific ones |
| No action items | Post-mortems and reviews that identify problems but assign no follow-up | Every identified gap must produce a specific, assigned, dated action item |
Gotchas
RFCs without a decision deadline stay in "review" forever - An RFC without a deadline becomes a perpetual discussion that blocks implementation. Always set a concrete decision deadline (typically 1-2 weeks) in the frontmatter, and explicitly close the RFC as Approved or Rejected on that date even if not everyone has commented.
Post-mortems written more than a week after the incident lose critical detail - Memory degrades fast. Timelines reconstructed from memory a week later miss key decision points and often misattribute causality. The IC should assign a post-mortem owner and require a draft timeline within 24 hours of resolution, even if the full document takes 5 days.
ADRs edited retroactively destroy the historical record - An ADR is only valuable as a record of what was decided and why at a specific point in time. If you update an ADR to reflect a changed decision, future readers can't distinguish the original context from the revision. Write a new ADR that supersedes the old one; mark the old one "Superseded by ADR-NNN".
Runbooks with "check the dashboard" as a step fail at 3 AM - "Check the monitoring dashboard" is not a runbook step. A runbook step specifies which dashboard, which panel, what a normal reading looks like, and what to do if it's abnormal. Vague steps require context the on-call engineer won't have. Every step needs a specific action, an expected result, and a failure path.
Wiki pages without owners decay into organizational memory holes - A wiki page written once and never reviewed will be confidently wrong within 6-12 months for any actively developed system. Every page needs a named owner and a "Last verified" date. Unmaintained pages should be archived, not left as false ground truth.
References
For detailed content on specific document types, read the relevant file from
references/:
references/rfcs-and-design-docs.md- Deep guide on RFC lifecycle, review processes, and design doc patternsreferences/post-mortems.md- Blameless post-mortem methodology, 5 Whys technique, and severity frameworksreferences/runbooks.md- Runbook authoring patterns, testing procedures, and maintenance workflowsreferences/knowledge-management.md- Knowledge base organization, documentation culture, and tooling strategies
Only load a references file if the current task requires deep detail on that topic.
References
knowledge-management.md
Knowledge Management
The documentation quadrant
Adapt the Divio documentation framework to organize internal knowledge into four distinct categories, each with its own purpose, style, and audience:
| Category | Orientation | Written like | Example |
|---|---|---|---|
| Tutorials | Learning | A lesson (do this, then this) | "Your first week: setting up local dev" |
| How-to guides | Task | A recipe (to accomplish X, do Y) | "How to deploy a canary release" |
| Reference | Information | An encyclopedia (X is defined as...) | "Service catalog with owners and SLAs" |
| Explanation | Understanding | An essay (the reason we chose X is...) | "Why we migrated from monolith to microservices" |
Why this matters
The most common documentation problem is mixing categories. A tutorial that stops to explain architectural history loses the reader. A reference page that tries to teach concepts becomes unusable for quick lookups. Keep categories separate and link between them.
Information architecture
Organizing by system, not by team
Teams change. Systems persist. Organize documentation around systems and domains, not org chart structure.
Bad structure:
/docs
/backend-team
/frontend-team
/platform-teamGood structure:
/docs
/services
/checkout-service
/user-service
/payment-gateway
/infrastructure
/kubernetes
/databases
/ci-cd
/guides
/onboarding
/deployment
/incident-response
/decisions
/rfcs
/adrsNaming conventions
- Use lowercase with hyphens:
checkout-service-runbook.mdnotCheckoutServiceRunbook.md - Prefix with document type when browsing matters:
rfc-002-new-auth-flow.md - Include dates for time-sensitive docs:
2024-03-postmortem-checkout-outage.md - Use index files for directories: every folder gets a
README.mdorindex.md
Search and discoverability
Documentation that can't be found doesn't exist. Improve discoverability with:
- Tags/labels - Consistent tagging taxonomy across all docs (system, team, type)
- Cross-links - Every doc links to related docs. Post-mortems link to runbooks. RFCs link to ADRs. How-to guides link to reference pages.
- A single entry point - One "documentation home" page that links to all categories with brief descriptions
- Search optimization - Use descriptive titles, include synonyms in the first paragraph, use standard terminology
Documentation culture
The documentation-as-code approach
Treat docs like code:
- Version controlled - Store in git alongside the code they describe
- Reviewed - Documentation changes go through PR review
- Tested - Links are checked, code examples are validated
- Deployed - Published automatically via CI/CD to a docs site
Making documentation a habit
Documentation doesn't happen by default. Build it into workflows:
| Trigger | Documentation action |
|---|---|
| New feature merged | Update or create how-to guide |
| Architecture decision made | Write an ADR |
| Incident resolved | Write post-mortem within 48 hours |
| New team member joins | Note gaps in onboarding docs and fix them |
| Quarterly review | Audit and archive stale docs |
Ownership model
Every document needs an owner. Without ownership, docs rot.
| Ownership model | How it works | Best for |
|---|---|---|
| Individual owner | One person responsible for keeping it current | ADRs, RFCs, post-mortems |
| Team owner | A team collectively maintains a set of docs | Service docs, runbooks |
| Rotating owner | Ownership rotates on a schedule | Knowledge base sections, onboarding |
Reducing friction
The biggest enemy of documentation is friction. Reduce it with:
- Templates - Pre-built templates for every doc type (RFC, post-mortem, runbook, ADR)
- Automation - Auto-generate reference docs from code (API specs, config schemas)
- Low ceremony - A rough doc today is better than a perfect doc never written
- Visible wins - Celebrate when documentation prevents an incident or speeds up onboarding
Onboarding documentation
The 30-60-90 day guide
Structure onboarding docs around what a new hire needs at each stage:
Week 1 (Day 1-5): Setup and orientation
- Local dev environment setup (step-by-step, tested monthly)
- Key tools and access: list every tool with how to request access
- Team norms: communication channels, meeting schedule, PR conventions
- Architecture overview: one-page system diagram with brief descriptions
Month 1 (Day 6-30): First contributions
- "Good first issues" labeled in the issue tracker
- Code walkthrough of the main service the team owns
- Key contacts: who to ask about what
- Common workflows: how to deploy, how to run tests, how to debug
Month 2-3 (Day 31-90): Independence
- On-call training and runbook orientation
- Deep dives into complex subsystems
- Cross-team collaboration guides
- Contributing to documentation themselves (closing the loop)
Managing documentation debt
Documentation audit checklist
Run quarterly:
- Identify docs with "Last updated" > 6 months ago
- Check all links for 404s (automate this)
- Verify code examples still compile/run
- Remove docs for decommissioned systems
- Merge duplicate docs covering the same topic
- Update ownership for docs owned by people who left
The archive decision
Not everything needs to be kept current. Use this framework:
| Condition | Action |
|---|---|
| Doc describes a current system/process | Keep and maintain |
| Doc describes a deprecated system still in use | Mark as "legacy" with migration pointer |
| Doc describes a decommissioned system | Archive (move to /archive, keep for history) |
| Doc is a finalized decision record (RFC, ADR) | Keep as-is, never edit (it's a historical record) |
| Doc is a duplicate of another, better doc | Redirect to the canonical version and delete |
Tooling recommendations
Choose tools that reduce friction and support the documentation-as-code approach:
| Need | Recommended approach |
|---|---|
| Engineering docs | Markdown in git repo, rendered via static site generator |
| Runbooks | Markdown in git, linked from alert definitions |
| RFCs/ADRs | Markdown in a dedicated /decisions directory in the main repo |
| Knowledge base | Wiki tool (Notion, Confluence) or git-based wiki |
| API reference | Auto-generated from OpenAPI/GraphQL schema |
| Diagrams | Mermaid or PlantUML in markdown (version-controlled, no binary files) |
post-mortems.md
Post-Mortems
Blameless culture
Blameless post-mortems are the foundation of a learning organization. The principle is simple: people make mistakes because systems allow those mistakes to happen. Blaming individuals creates fear, suppresses reporting, and prevents the organization from fixing systemic issues.
Blameless does NOT mean:
- Nobody is accountable
- No changes are needed
- The incident was acceptable
Blameless DOES mean:
- We focus on system and process failures, not individual errors
- We assume everyone acted with the best information available at the time
- We create an environment where people report honestly without fear
Language guide
| Instead of this | Write this |
|---|---|
| "Engineer X deployed the bad config" | "A configuration change was deployed that..." |
| "The team failed to catch the bug" | "The testing pipeline did not include a check for..." |
| "Nobody noticed the alert" | "The alert was not routed to the on-call channel" |
| "The developer should have known" | "The documentation did not cover this failure mode" |
Severity framework
Use a consistent severity scale so post-mortems are prioritized appropriately:
| Severity | Criteria | Post-mortem required? | Review meeting? |
|---|---|---|---|
| SEV-1 | Complete service outage, data loss, or security breach affecting all users | Yes, within 48 hours | Yes, org-wide |
| SEV-2 | Partial outage or degradation affecting >10% of users for >30 minutes | Yes, within 1 week | Yes, team + stakeholders |
| SEV-3 | Minor degradation, brief outage, or issue affecting <10% of users | Optional but recommended | Team-only if written |
| SEV-4 | Near-miss or caught before user impact | Optional - consider a brief writeup | No |
The 5 Whys technique
The 5 Whys is a root cause analysis method. Start with the symptom and ask "why" repeatedly until you reach a systemic cause.
Example:
- Why did users see 500 errors? Because the checkout service was returning errors.
- Why was the checkout service returning errors? Because it couldn't connect to the database.
- Why couldn't it connect? Because the connection pool was exhausted.
- Why was the pool exhausted? Because a long-running query was holding connections.
- Why was there a long-running query? Because the new report feature had no query timeout configured, and the ORM generated an unindexed full-table scan.
Root cause: Missing query timeout configuration and no index on the report query.
5 Whys pitfalls
- Stopping too early - "The config was wrong" is a symptom, not a root cause. Ask why the wrong config was possible.
- Single-threading - Most incidents have multiple contributing causes. Branch your "whys" when you hit a fork.
- Going too deep - "Why do humans make mistakes?" is too philosophical. Stop when you reach something you can fix with a concrete action.
Timeline writing
The timeline is the backbone of a post-mortem. Write it in UTC with this format:
## Timeline (all times UTC)
| Time | Event |
|---|---|
| 14:00 | Deploy of checkout-service v2.4.1 begins via CI/CD pipeline |
| 14:03 | Deploy completes. No alerts triggered |
| 14:15 | Monitoring shows p99 latency increase from 200ms to 1200ms |
| 14:18 | PagerDuty alert fires: "checkout-service latency > 1000ms" |
| 14:20 | On-call engineer acknowledges alert, begins investigation |
| 14:25 | Engineer identifies database connection pool exhaustion in metrics |
| 14:32 | Decision to rollback deploy. Rollback initiated |
| 14:38 | Rollback complete. Latency returns to normal within 2 minutes |
| 14:40 | Incident declared resolved |Timeline tips
- Include detection time, investigation steps, decision points, and resolution
- Note who was involved at each step (by role, not name, in a blameless doc)
- Include things that didn't work ("Attempted restart, no improvement")
- Be precise about timing - round to the nearest minute
Action items that stick
The most common failure of post-mortems is action items that never get completed. Make them stick with this framework:
The SMART action item
| Component | Rule | Bad example | Good example |
|---|---|---|---|
| Specific | Describes exactly what to build/change | "Improve monitoring" | "Add p99 latency alert at 500ms on /checkout endpoint" |
| Measurable | Has a clear definition of done | "Better testing" | "Add integration test covering the N+1 query path" |
| Assigned | Has a single owner (not a team) | "Backend team" | "Alice (backend)" |
| Realistic | Achievable within the given timeframe | "Rewrite the service" | "Add query timeout of 30s to report queries" |
| Time-bound | Has a due date | "Soon" | "By 2024-03-15" |
Priority categories
- P0 - Immediate (within 1 week): Prevents recurrence of this exact incident
- P1 - Soon (within 1 month): Reduces blast radius or improves detection
- P2 - Planned (within 1 quarter): Systemic improvements and tech debt
Tracking action items
- Create tickets in the team's issue tracker immediately after the post-mortem meeting
- Link tickets back to the post-mortem document
- Review completion status in weekly team standup until all P0/P1 items are done
- Include action item completion rate in quarterly engineering metrics
Post-mortem review meeting
Agenda (45-60 minutes)
- Summary and timeline walkthrough (10 min) - Author presents the incident
- Root cause discussion (15 min) - Group validates or challenges the analysis
- Action item review (15 min) - Assign owners and priorities for each item
- Process improvements (5 min) - Meta-discussion on the incident response itself
- Closing (5 min) - Confirm action items and document owner
Facilitation rules
- The facilitator is NOT the author - get a neutral party
- Redirect any blame to systems: "Let's focus on what the system allowed to happen"
- Time-box tangents: "Great point, let's capture that as an action item"
- End with appreciation: thank the responders and the author
rfcs-and-design-docs.md
RFCs and Design Docs
RFC vs Design Doc vs ADR
These terms are often used interchangeably but serve different purposes:
| Document | Scope | Lifespan | Decision weight |
|---|---|---|---|
| RFC (Request for Comments) | Cross-team or org-wide changes | Weeks of review | High - needs broad consensus |
| Design doc | Single team or feature | Days to a week of review | Medium - team lead approval |
| ADR (Architecture Decision Record) | One specific decision | Written once, never edited | Low - records what was decided |
Rule of thumb: If the change affects more than one team's codebase or introduces a new technology, it's an RFC. If it's a complex feature within one team, it's a design doc. If it's a focused architectural choice, it's an ADR.
RFC lifecycle
Draft -> In Review -> Approved / Rejected / Withdrawn
\-> Needs Revision -> In Review (loop)Draft phase
- Author writes the initial proposal using the RFC template
- Share early with 1-2 trusted reviewers for a "pre-review" before formal circulation
- Include a decision deadline (typically 1-2 weeks from circulation)
Review phase
- Circulate to all stakeholders via the team's standard channel (email, Slack, doc comments)
- Reviewers leave inline comments on specific sections
- Author responds to every comment - either incorporate feedback or explain why not
- Schedule a synchronous review meeting only if async comments reveal fundamental disagreements
Decision phase
- The designated decision-maker (tech lead, architect, or committee) makes the call
- Document the decision and reasoning at the top of the RFC
- If rejected, explain why clearly - rejected RFCs are valuable institutional knowledge
Writing effective motivation sections
The motivation section is the most important part of an RFC. It must answer three questions:
What problem exists today? Describe the pain concretely with data if possible. "API latency p99 has increased from 200ms to 800ms over the last quarter due to N+1 queries in the order service" is better than "performance is degrading."
Why does it matter? Connect the problem to business or engineering outcomes. "This latency increase has caused a 12% drop in checkout completion rate."
Why now? Explain the urgency. Is there a deadline, a scaling cliff, or a dependency that makes this the right time?
Alternatives section best practices
The alternatives section proves you've done your homework. For each alternative:
- Name it clearly - "Alternative A: Migrate to GraphQL" not "Another option"
- Give it a fair shot - Describe it as if you were proposing it
- List honest pros and cons - If an alternative has no pros, you haven't thought hard enough
- Explain why you didn't choose it - Be specific about the deciding factor
Minimum: 2 alternatives. If you can only think of one alternative ("do nothing"), you haven't explored the solution space enough.
Design doc specifics
Design docs are lighter than RFCs. Key differences:
- Shorter review cycle - 2-5 days instead of 1-2 weeks
- Narrower audience - Team members and direct stakeholders
- More implementation detail - Include API schemas, data models, sequence diagrams
- Less process - No formal approval committee, team lead signs off
Design doc template additions (beyond RFC template)
## API design
<Endpoint definitions, request/response schemas>
## Data model
<Schema changes, new tables/collections, migration plan>
## Sequence diagram
<Key flows showing component interactions>
## Testing strategy
<How will this be tested? Unit, integration, E2E coverage plan>
## Observability
<What metrics, logs, and alerts will be added?>Review etiquette
For reviewers
- Be specific - "This doesn't handle the case where X" is useful. "I don't like this" is not.
- Distinguish blocking vs non-blocking - Prefix with "Blocking:" or "Nit:" or "Question:"
- Suggest, don't prescribe - "Have you considered X?" not "You should do X"
- Focus on the proposal, not the person - "This approach has a scalability risk" not "You didn't think about scale"
- Respond within the deadline - No response is implicit approval in most RFC processes
For authors
- Respond to every comment - Even if just "Acknowledged, updated section 3"
- Don't be defensive - Reviewers are improving the proposal, not attacking you
- Update the doc, not just the comment thread - The doc is the source of truth
- Call out material changes - If review feedback significantly changes the proposal, re-notify reviewers
Common RFC anti-patterns
| Anti-pattern | Problem | Fix |
|---|---|---|
| The Novel | 20+ page RFC that nobody reads | Keep to 3-5 pages. Move detail to appendices |
| The Fait Accompli | RFC written after implementation started | Write the RFC first. If urgent, be transparent that implementation is underway |
| The Straw Man | Alternatives listed are obviously terrible to make the proposal look good | Include genuinely competitive alternatives |
| The Infinite Review | RFC stays "In Review" for months | Set a hard deadline. No decision by deadline = author's proposal wins |
| The Ghost RFC | Approved but never referenced again | Link to RFC from implementation PRs and ADRs |
| Missing constraints | No mention of timeline, budget, team capacity | Include a "Constraints" section covering real-world limits |
runbooks.md
Runbooks
What makes a good runbook
A runbook is a set of step-by-step instructions for performing an operational task. The golden test: can an engineer who has never seen this system follow the runbook at 3 AM and resolve the issue without calling anyone?
The three rules
No ambiguity - Every step has one interpretation. "Check the logs" fails. "Run
kubectl logs -n production deployment/checkout --tail=100" succeeds.No assumed knowledge - Don't assume the reader knows where things are, what tools to use, or what "normal" looks like. Show expected output.
No dead ends - Every step has a "what if this doesn't work" path. The reader should never be stuck without a next action.
Runbook anatomy
# Runbook: <Descriptive title>
**Owner:** <team name>
**Last verified:** <date someone actually ran through this>
**Estimated time:** <realistic duration>
**Risk level:** Low | Medium | High
**Related alerts:** <list PagerDuty/Datadog alert names that lead here>
## When to use
<Specific trigger conditions. What alert fires? What symptom appears? What request
comes in? Be explicit about when this runbook applies AND when it does not.>
## Prerequisites
- [ ] Access to <specific system> via <specific tool>
- [ ] Permissions: <specific IAM role, kubectl context, or credentials>
- [ ] Tools installed: <specific CLI tools with version requirements>
## Steps
### Step 1: <Verb phrase describing the action>
<Exact command or UI navigation path>
```bash
<command to run>Expected output:
<what the output should look like when things are working>If this fails:
- If you see
<error message>: - If the command hangs:
- If output differs from expected:
Step 2: ...
Verification
<How to confirm the procedure worked. Specific checks, not "verify it's working.">
Rollback
Escalation
| Condition | Contact | Channel |
|---|---|---|
| Steps don't resolve the issue | <PagerDuty/Slack> | |
| Data loss suspected | <team lead + data team> | <phone + Slack> |
| Customer-facing for >30 min |
## Writing effective steps
### Command formatting
Always provide complete, copy-pasteable commands:
**Bad:**Check the pod status
**Good:**
```bash
# Check pod status in the production namespace
kubectl get pods -n production -l app=checkout-service
# Expected: All pods should show STATUS=Running, READY=1/1
# If any pod shows CrashLoopBackOff, proceed to Step 3Decision points
When a step requires judgment, provide explicit decision trees:
### Step 3: Assess memory usage
```bash
kubectl top pods -n production -l app=checkout-serviceDecision tree:
- Memory usage < 80%: This is not a memory issue. Skip to Step 5.
- Memory usage 80-95%: Proceed to Step 4 (graceful restart).
- Memory usage > 95%: Proceed to Step 4 with
--forceflag.
### Variable substitution
When commands need environment-specific values, use clear placeholders:
```bash
# Replace <ENVIRONMENT> with: production | staging
# Replace <POD_NAME> with the output of Step 1
kubectl exec -it <POD_NAME> -n <ENVIRONMENT> -- /bin/shRunbook categories
| Category | Purpose | Example |
|---|---|---|
| Incident response | Fix a specific failure mode | "Database connection pool exhausted" |
| Maintenance | Perform scheduled operations | "Monthly certificate rotation" |
| Provisioning | Set up new resources | "Onboard a new microservice" |
| Troubleshooting | Diagnose an unknown issue | "High latency investigation flowchart" |
| Recovery | Restore from a failure state | "Restore database from backup" |
Testing and maintenance
Verification schedule
Runbooks must be tested regularly. An untested runbook is a liability.
| Risk level | Verification frequency | Method |
|---|---|---|
| High (incident response) | Monthly | Dry run or game day exercise |
| Medium (maintenance) | Quarterly | Execute during scheduled maintenance |
| Low (provisioning) | Every use | Verify steps match current tooling |
Testing method
Dry run - Walk through the runbook without executing destructive steps. Verify all commands are valid and outputs match expectations.
Shadow execution - Run the procedure in a staging environment that mirrors production.
Game day - Schedule a simulated incident where a team member follows the runbook under realistic conditions.
Maintenance workflow
Runbook created -> Author verifies -> Peer review -> Published
^ |
| v
+-- Update needed <-- Quarterly review <-- In use <--+After each use, the executor should:
- Note any steps that were unclear, incorrect, or missing
- Update the runbook immediately (while context is fresh)
- Update the "Last verified" date
Staleness indicators
Flag a runbook for review if:
- "Last verified" is more than 3 months old
- The system it covers has had a major version change
- An incident occurred where the runbook was followed but didn't resolve the issue
- A team member reports confusion during execution
Linking runbooks to alerts
Every production alert should link to a runbook. In the alert definition:
# PagerDuty / Datadog / Grafana alert annotation
annotations:
runbook_url: "https://wiki.internal/runbooks/checkout-high-latency"
summary: "Checkout service p99 latency > 1000ms for 5 minutes"This ensures the on-call engineer sees the runbook link immediately when paged, reducing mean time to resolution (MTTR).
Frequently Asked Questions
What is internal-docs?
Use this skill when writing, reviewing, or improving internal engineering documents - RFCs, design docs, post-mortems, runbooks, and knowledge base articles. Triggers on drafting a design proposal, writing an RFC, creating a post-mortem after an incident, building an operational runbook, organizing team knowledge, or improving existing documentation for clarity and completeness.
How do I install internal-docs?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill internal-docs in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support internal-docs?
internal-docs works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.