backend-engineering

Use this skill when designing backend systems, databases, APIs, or services. Triggers on schema design, database migrations, indexing strategies, distributed systems architecture, microservices, caching, message queues, observability setup, logging, metrics, tracing, SLO/SLI definition, performance optimization, query tuning, security hardening, authentication, authorization, API design (REST, GraphQL, gRPC), rate limiting, pagination, and failure handling patterns. Acts as a senior backend engineering advisor for mid-level engineers leveling up.

What is backend-engineering?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The backend-engineering skill is now active and ready to use

Overview Files

backend-engineering

backend-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex, and 1 more. Designing backend systems, databases, APIs, or services.

Quick Facts

Field	Value
Category	engineering
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex, mcp
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering

The backend-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A senior backend engineer's decision-making framework for building production systems. This skill covers the six pillars of backend engineering - schema design, scalable systems, observability, performance, security, and API design - with an emphasis on when to use each pattern, not just how. Designed for mid-level engineers (3-5 years) who know the basics and need opinionated guidance on trade-offs.

Platforms

claude-code
gemini-cli
openai-codex
mcp

Related Skills

Pair backend-engineering with these complementary skills:

Frequently Asked Questions

What is backend-engineering?

How do I install backend-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support backend-engineering?

This skill works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Backend Engineering

When to use this skill

Trigger this skill when the user:

Designs a database schema or plans a migration
Chooses between monolith vs microservices or evaluates scaling strategies
Sets up logging, metrics, tracing, or alerting
Diagnoses a performance issue (slow queries, high latency, memory pressure)
Implements authentication, authorization, or secrets management
Designs a REST, GraphQL, or gRPC API
Needs retry, circuit breaker, or idempotency patterns
Plans data consistency across services (sagas, outbox, eventual consistency)

Do NOT trigger this skill for:

Frontend-only concerns (CSS, React components, browser APIs)
DevOps/infra provisioning (use a Terraform/Docker/K8s skill instead)

Key principles

Design for failure, not just success - Every network call can fail. Every disk can fill. Every dependency can go down. The question is not "will it fail" but "how does it degrade?" Design graceful degradation paths before writing the happy path.
Observe before you optimize - Never guess where the bottleneck is. Instrument first, measure second, optimize third. A 10ms query called 1000 times matters more than a 500ms query called once.
Simple until proven otherwise - Start with a monolith, a single database, and synchronous calls. Add complexity (microservices, queues, caches) only when you have evidence the simple approach fails. Every architectural boundary is a new failure mode.
Secure by default, not by afterthought - Auth, input validation, and encryption are not features to add later. They are constraints to build within from day one. Use established libraries. Never roll your own crypto.
APIs are contracts, not implementation details - Once published, an API is a promise. Design from the consumer's perspective inward. Version explicitly. Break nothing silently.

Core concepts

Backend engineering is the discipline of building reliable, performant, and secure server-side systems. The six pillars form a hierarchy:

Schema design is the foundation - get the data model wrong and everything built on top inherits that debt. Scalable systems define how components communicate and grow. Observability gives you eyes into what's actually happening in production. Performance is the art of making it fast after you've made it correct. Security is the set of constraints that keep the system trustworthy. API design is the surface area through which consumers interact with all of the above.

These pillars are not independent. A bad schema creates performance problems. Poor observability makes security incidents invisible. A poorly designed API forces clients into patterns that break your scaling strategy. Think of them as a connected system, not a checklist.

Common tasks

Design a database schema

Start from access patterns, not entity relationships. Ask: "What queries will this serve?" before drawing a single table.

Decision framework:

Read-heavy, predictable queries -> Normalize (3NF), add targeted indexes
Write-heavy, high throughput -> Consider denormalization, append-only tables
Complex relationships with traversals -> Consider a graph model
Unstructured/evolving data -> Document store (but think twice)

Indexing rule of thumb: Index columns that appear in WHERE, JOIN, and ORDER BY. A composite index on (a, b, c) serves queries on (a), (a, b), and (a, b, c) but NOT (b, c). Check the references/ file for detailed indexing strategies.

Always plan migration rollbacks. A deploy that adds a column is safe. A deploy that drops a column is a one-way door. Use expand-contract migrations for breaking changes.

Choose a scaling strategy

Is a single server sufficient?
  YES -> Stay there. Optimize vertically first.
  NO  -> Is the bottleneck compute or data?
    COMPUTE -> Horizontal scale with stateless services + load balancer
    DATA    -> Is it read-heavy or write-heavy?
      READ  -> Add read replicas, then caching layer
      WRITE -> Partition/shard the database

Only introduce microservices when you have: (a) independent deployment needs, (b) different scaling profiles per component, or (c) team boundaries that demand it.

Never split a monolith along technical layers (API service, data service). Split along business domains (orders, payments, inventory).

Set up observability

Implement the three pillars with correlation:

Pillar	What it answers	Tool examples
Logs	What happened?	Structured JSON logs with correlation IDs
Metrics	How is the system performing?	RED metrics (Rate, Errors, Duration)
Traces	Where did time go?	Distributed traces across service boundaries

Define SLOs before writing alerts. An SLO like "99.9% of requests complete in <200ms" gives you an error budget. Alert when the burn rate threatens the budget, not on every spike.

Diagnose a performance issue

Follow this checklist in order:

Check metrics - is it CPU, memory, I/O, or network?
Check slow query logs - are there N+1 patterns or full table scans?
Check connection pools - are connections exhausted or leaking?
Check external dependencies - is a downstream service slow?
Profile the code - only after ruling out infrastructure causes

The fix for "the database is slow" is almost never "add more database." It's usually: add an index, fix an N+1, or cache a hot read path.

Secure a service

Minimum security checklist for any backend service:

Authentication: Use OAuth 2.0 / OIDC for user-facing, API keys + HMAC for service-to-service. Never store plain-text passwords (bcrypt/argon2 minimum).
Authorization: Implement at the middleware level. Default deny. Check permissions on every request, not just at the edge.
Input validation: Validate at system boundaries. Use allowlists, not blocklists. Parameterize all SQL queries.
Secrets: Use a secrets manager (Vault, AWS Secrets Manager). Never commit secrets to git. Rotate regularly.
Transport: TLS everywhere. No exceptions.

Design an API

REST decision table:

Need	Pattern
Simple CRUD	REST with standard HTTP verbs
Complex queries with flexible fields	GraphQL
High-performance internal service calls	gRPC
Real-time bidirectional	WebSockets
Event notification to external consumers	Webhooks

Pagination: Use cursor-based for large/changing datasets, offset-based only for small/static datasets. Always include a next_cursor field.

Versioning: URL path versioning (/v1/) for public APIs, header versioning for internal. Never break existing consumers silently.

Rate limiting: Token bucket for user-facing, fixed window for internal. Always return Retry-After headers with 429 responses.

Handle partial failures

When services depend on other services, failures cascade. Use these patterns:

Retry with exponential backoff + jitter - for transient failures (network blips, 503s). Cap at 3-5 retries.
Circuit breaker - stop calling a failing dependency. States: closed (normal) -> open (failing, fast-fail) -> half-open (testing recovery).
Idempotency keys - make retries safe. Every mutating operation should accept an idempotency key so duplicate requests produce the same result.
Timeouts - always set them. A missing timeout is an unbounded resource leak.

Plan data consistency

For distributed data across services:

Strong consistency needed? -> Single database, ACID transactions
Can tolerate eventual consistency? -> Event-driven with outbox pattern
Multi-step business process? -> Saga pattern (prefer choreography over orchestration for simple flows, orchestration for complex ones)

The outbox pattern: write the event to a local "outbox" table in the same transaction as the data change. A separate process publishes outbox events to the message broker. This guarantees at-least-once delivery without 2PC.

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Premature microservices	Creates distributed monolith, adds network failure modes	Start monolith, extract services when domain boundaries are proven
Missing indexes on query columns	Full table scans under load, cascading timeouts	Profile queries with EXPLAIN, add indexes for WHERE/JOIN/ORDER BY
Logging everything, alerting on nothing	Alert fatigue, real incidents get buried	Structured logs with levels, SLO-based alerting on burn rate
N+1 queries in loops	Linear query growth per record, kills DB under load	Batch fetches, eager loading, or dataloader pattern
Rolling your own auth/crypto	Subtle security bugs that go unnoticed for months	Use battle-tested libraries (bcrypt, passport, OIDC providers)
Designing APIs from the database out	Leaks internal structure, painful to evolve	Design from consumer needs inward, then map to storage
Destructive migrations without rollback	One-way door that can cause downtime	Expand-contract pattern, backward-compatible migrations
Caching without invalidation strategy	Stale data, cache-database drift, inconsistency	Define TTL, invalidation triggers, and cache-aside pattern upfront

Gotchas

Expand-contract is the only safe way to remove a column - Deploying code that removes a column before the column is dropped from the database causes immediate errors. Deploying a migration that drops a column while old code still reads it causes the same. The only safe path: deploy new code that ignores the old column, then deploy the migration that drops it, then optionally clean up the code.
Connection pool exhaustion looks like a slow database - When all connections in the pool are in use, new queries queue up indefinitely. Profiling shows slow queries; the real problem is too many concurrent requests or a connection leak. Check pool metrics (active, idle, waiting) before blaming the database.
Outbox pattern requires an idempotent consumer - The outbox pattern guarantees at-least-once delivery. If your message consumer isn't idempotent, it will process the same event twice after a crash and a restart. Every consumer must be able to handle duplicate messages safely.
N+1 queries in ORM code are invisible until production load - Fetching a list of 50 orders and then calling .customer on each in a loop generates 51 queries. In development with 5 rows it's imperceptible; under production load it causes cascading timeouts. Always check query counts in integration tests and use eager loading for related data.
Circuit breakers need a half-open timeout - A circuit that opens on failure and never closes traps a service in permanent degraded mode even after the downstream dependency recovers. Always configure a half-open probe interval so the breaker tests recovery and transitions back to closed state automatically.

References

For detailed patterns and implementation guidance on specific domains, read the relevant file from the references/ folder:

references/schema-design.md - normalization, indexing strategies, migration patterns
references/scalable-systems.md - distributed patterns, caching, queues, load balancing
references/observability.md - logging, metrics, tracing, SLOs, alerting setup
references/performance.md - profiling, query optimization, connection pooling, async
references/security.md - auth flows, encryption, OWASP top 10, secrets management
references/api-design.md - REST/GraphQL/gRPC conventions, versioning, pagination
references/failure-patterns.md - circuit breakers, retries, idempotency, sagas

Only load a references file if the current task requires it - they are long and will consume context.

References

api-design.md

API Design Reference

Opinionated defaults for mid-level backend engineers. When in doubt, pick the simpler option.

1. REST Conventions

Resource Naming

Do	Don't
`GET /users`	`GET /getUsers`
`GET /users/42`	`GET /user/42`
`POST /users`	`POST /createUser`
`GET /users/42/orders`	`GET /getUserOrders?userId=42`

Rules: Plural nouns always. No verbs in URLs. Lowercase, hyphen-separated (/order-items). Max two levels of nesting - flatten beyond that.

HTTP Method Semantics

Method	Semantics	Idempotent	Safe	Body
`GET`	Read	Yes	Yes	No
`POST`	Create / trigger action	No	No	Yes
`PUT`	Full replace	Yes	No	Yes
`PATCH`	Partial update	Yes*	No	Yes
`DELETE`	Remove	Yes	No	No

Default: Use PATCH for updates, not PUT. Clients rarely have the full resource.

Status Codes

Success: 200 GET/PATCH ok | 201 POST created (+ Location header) | 202 async accepted | 204 DELETE ok, no body

Client errors:

Code	Use	Common Mistake
`400`	Malformed JSON, missing required field	-
`401`	No auth / expired token ("I don't know who you are")	Using 403
`403`	Authenticated but no permission ("I know you, and no")	Using 401
`404`	Resource doesn't exist	-
`409`	Duplicate creation, version conflict	-
`410`	Permanently deleted (was here, now gone)	Using 404
`422`	Valid JSON but fails business rules	Using 400
`429`	Rate limited (always include `Retry-After`)	Forgetting header

400 vs 422: 400 = structural (bad JSON). 422 = business logic (email taken).

Nested vs Flat

Default: Flat with filters. Nest only when child can't exist without parent.

GET /users/42/orders       # OK - order belongs to user
GET /orders?user_id=42     # Better for most cases

HATEOAS

Skip it. Exception: public APIs for third parties where discoverability matters.

2. GraphQL

When It Makes Sense

Good Fit	Poor Fit
Mobile clients with varied data needs	Simple CRUD
BFF pattern, multiple frontend teams	Server-to-server (use gRPC)
Reducing over-fetching on slow networks	File uploads, HTTP caching needed

Schema Design

Use Relay connection spec for lists. Non-nullable by default. Prefix mutations: createUser, updateUser. Return mutated object from mutations.

type User {
  id: ID!
  email: String!
  orders(first: Int, after: String): OrderConnection!
}
type OrderConnection { edges: [OrderEdge!]!; pageInfo: PageInfo! }
type OrderEdge { cursor: String!; node: Order! }
type PageInfo { hasNextPage: Boolean!; endCursor: String }

N+1 and DataLoader

# BAD - called once per user = N+1
resolve User.orders(user):
    return db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
# GOOD - DataLoader batches all keys into one query
resolve User.orders(user):
    return orderLoader.load(user.id)
batch_load_orders(user_ids):
    return db.query("SELECT * FROM orders WHERE user_id IN (?)", user_ids)

Query Complexity Limits

Assign cost per field. Reject queries exceeding max complexity (1000) or max depth (10). Use persisted queries (APQ) in production - prevents arbitrary execution, smaller payloads, auditable.

3. gRPC

When gRPC Shines

Strength	Why
Internal service-to-service	Strict protobuf contracts, fast serialization
High throughput / low latency	Binary protocol, HTTP/2 multiplexing
Polyglot environments	Auto-generated clients for any language
Streaming data	First-class bidirectional streaming

Protobuf Schema Design

syntax = "proto3";
package orders.v1;
service OrderService {
  rpc GetOrder(GetOrderRequest) returns (GetOrderResponse);
  rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse);
  rpc StreamOrderUpdates(StreamOrderUpdatesRequest) returns (stream OrderUpdate);
}
message GetOrderRequest { string order_id = 1; }
message GetOrderResponse { Order order = 1; }
message Order {
  string id = 1; string user_id = 2; OrderStatus status = 3;
  repeated OrderItem items = 4; google.protobuf.Timestamp created_at = 5;
}
enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0; ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2; ORDER_STATUS_SHIPPED = 3;
}

Rules: Always wrap in request/response messages. UNSPECIFIED = 0 first enum. Package as <service>.v1. Use google.protobuf.Timestamp.

Unary vs Streaming

Default: Unary. Server streaming for live feeds/large results. Client streaming for uploads/batching. Bidirectional for chat/collab. Only stream when you need it.

Error Model

gRPC Code	HTTP	Use
`INVALID_ARGUMENT`	400	Bad input
`NOT_FOUND`	404	Missing resource
`ALREADY_EXISTS`	409	Duplicate
`PERMISSION_DENIED`	403	Forbidden
`UNAUTHENTICATED`	401	No/bad auth
`RESOURCE_EXHAUSTED`	429	Rate limited
`INTERNAL`	500	Server bug
`UNAVAILABLE`	503	Temporary, retry

Browser Clients

gRPC doesn't work natively in browsers. Use gRPC-Web proxy (Envoy) or Connect protocol (Buf). Or just use REST/GraphQL for frontend, gRPC internally.

4. Pagination

Cursor vs Offset Decision

Factor	Cursor-Based	Offset-Based
Large dataset perf	O(1) seek	O(n) scan
Consistency under writes	Stable	Items shift
Random page access	No	Yes (`?page=5`)
Default for APIs	Yes	Admin UIs only

Cursor Implementation

# Cursor = base64(json({"id": last_id, "sort": sort_value}))
GET /orders?limit=20&cursor=eyJpZCI6NDIsInNvcnQiOiIyMDI0LTAxLTE1In0=
# Server:
decode cursor -> {id: 42, sort: "2024-01-15"}
SELECT * FROM orders WHERE (created_at, id) < ('2024-01-15', 42)
ORDER BY created_at DESC, id DESC LIMIT 21;  -- +1 to check hasNextPage

Always include primary key in cursor alongside sort key to handle ties.

Page Size & Response

Enforce server-side max: limit = min(request.limit || 20, 100)

{ "data": [...], "pagination": { "next_cursor": "...", "has_next_page": true, "page_size": 20 } }

5. Versioning

Strategy

Strategy	Pros	Cons	Use When
URL path `/v1/`	Obvious, easy routing/caching	URL pollution	Default
Header `Accept-Version`	Clean URLs	Hidden, hard to debug	Internal w/ sophisticated clients
Query param	Easy browser testing	Caching issues	Almost never

Breaking vs Non-Breaking

Safe: Add response field, add optional param, add endpoint, widen validation. Breaking (new version): Remove/rename field, change type, change URL, tighten validation.

Sunset Process

HTTP/1.1 200 OK
Deprecation: true
Sunset: Sat, 01 Mar 2025 00:00:00 GMT
Link: <https://api.example.com/v2/users>; rel="successor-version"

External APIs: 6+ months notice. Internal: 1-3 months. Never support more than 2 active versions.

Multiple Versions

Route at controller layer, share service/domain logic. V1 and V2 controllers call same service, apply different serializers.

6. Rate Limiting

Algorithms

Algorithm	Behavior	Pros	Cons
Token Bucket	Fills at steady rate, request takes token	Allows bursts	Slightly complex
Sliding Window	Rolling time window count	Accurate	Memory cost
Fixed Window	Count per interval	Simple	2x spike at boundaries

Default: Token bucket. Most gateways (Kong, Envoy) use it.

Headers and 429 Response

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1710432000
# On 429:
Retry-After: 30       # ALWAYS include this
Content-Type: application/problem+json
{"type":"https://api.example.com/errors/rate-limited","title":"Rate limit exceeded","status":429}

Scopes and Tiers

Scope	Use Case
Per API key / user	Default, fairest
Per IP	Unauthenticated endpoints (login, signup)
Per endpoint	Protect expensive operations
Global	Infrastructure protection

Tier limits in API key/user record. Enforce at gateway, not application code. Example: free=100/hr, pro=5000/hr, enterprise=50000/hr.

7. API Design Checklist

Before Shipping Any Endpoint

Resource names are plural nouns, lowercase, hyphen-separated
Consistent field naming (snake_case or camelCase - pick one, stick with it)
Dates in ISO 8601 UTC (2024-01-15T09:30:00Z)
Errors follow RFC 7807 format with machine-readable code per field
5xx errors never expose internals (no stack traces, SQL, IPs)
All list endpoints paginated with enforced max page size
Filtering: ?status=shipped&created_after=2024-01-01
Sorting: ?sort=-created_at,+total (prefix for direction)
Field selection: ?fields=id,name,email
Bulk endpoints cap at 100 items, return per-item status (207)
POST endpoints accept Idempotency-Key header (UUIDv4, stored 24h)
Auth on all endpoints except health checks
Authorization at resource level, not just endpoint level
Rate limiting configured per scope
Request body size limits enforced
No secrets in URLs - use headers or body
OpenAPI spec auto-generated from code
Every endpoint has request/response examples documented

RFC 7807 Error Format

{"type":"https://api.example.com/errors/validation-error","title":"Validation Error",
 "status":422,"detail":"Email is invalid.","errors":[{"field":"email","code":"invalid_format"}]}

Quick Decision Defaults

Decision	Default
Frontend API style	REST (GraphQL if multiple clients with varied needs)
Service-to-service	gRPC
Pagination	Cursor-based
Versioning	URL path `/v1/`
Rate limiting	Token bucket
Error format	RFC 7807
Field naming	`snake_case` for JSON
Timestamps	ISO 8601, always UTC
IDs	UUIDs (not auto-increment in public APIs)
Auth	Bearer token in `Authorization` header

failure-patterns.md

Failure Handling Patterns

Distributed systems fail in partial, unpredictable ways. These patterns help services degrade gracefully instead of cascading failures.

1. Retry Patterns

Strategy	Delay Formula	Best For
Simple retry	Fixed delay (e.g., 1s)	Quick transient blips
Exponential backoff	`base * 2^attempt`	Sustained outages
Backoff + jitter	`random(0, base * 2^attempt)`	Production default

Retry (transient): 503, 429 (with backoff), connection reset, timeout, 502 Do NOT retry (permanent): 400, 401/403, 404, 409, 422

Retry budgets: Cap retries at 10-20% of total requests to prevent retry amplification.

def retry_with_backoff(op, max_retries=3, base_ms=100, max_ms=10000):
    for attempt in range(max_retries + 1):
        try:
            return op()
        except RetryableError as e:
            if attempt == max_retries:
                raise e
            delay = min(base_ms * (2 ** attempt), max_ms)
            sleep(random_between(0, delay))  # Full jitter

2. Circuit Breaker

Stops calling a dependency known to be failing. Three states:

CLOSED --[failure threshold]--> OPEN --[reset timeout]--> HALF-OPEN
HALF-OPEN --[trial succeeds]--> CLOSED | --[trial fails]--> OPEN

Parameter	Typical Default
Failure threshold	5 failures in 60s or 50% failure rate
Reset timeout	30s
Half-open trial count	3

When to use: External APIs, databases under load, any remote dependency with expected failures. NOT for local in-process calls.

class CircuitBreaker:
    def __init__(self, threshold=5, reset_ms=30000):
        self.state, self.failures, self.last_fail = CLOSED, 0, None

    def call(self, op):
        if self.state == OPEN:
            if now() - self.last_fail > self.reset_ms:
                self.state = HALF_OPEN
            else:
                raise CircuitOpenError("Failing fast")
        try:
            result = op()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == HALF_OPEN: self.state = CLOSED
        self.failures = 0

    def _on_failure(self):
        self.failures += 1
        self.last_fail = now()
        if self.failures >= self.threshold: self.state = OPEN

Circuit breaker + retry are complementary: Retries handle transient blips; circuit breaker handles sustained outages. Retries run inside the circuit breaker: Request -> CB -> Retry -> Downstream.

3. Idempotency

Retries without idempotency cause duplicate side effects (e.g., double charges).

Type	Description	Example
Natural	Inherently idempotent	`SET balance = 100`, `DELETE WHERE id = X`
Artificial	Made idempotent via keys	`CHARGE $50` with key `txn-abc-123`

Prefer natural idempotency (absolute values over deltas). Use idempotency keys when the operation is inherently non-idempotent.

def process_payment(request):
    key = request.headers["Idempotency-Key"]  # Client-generated UUID
    existing = db.query("SELECT result FROM idempotency_store WHERE key = ?", key)
    if existing:
        return existing.result  # Return cached response
    result = charge_payment(request.amount)
    db.execute("INSERT INTO idempotency_store (key, result, created_at) VALUES (?, ?, now())", key, result)
    return result

Database-level: Use INSERT ... ON CONFLICT DO NOTHING with idempotency key as primary/unique key.

TTL: Expire idempotency keys after 24-72 hours to prevent unbounded storage growth.

4. Timeouts

Type	Controls	Typical Default
Connect timeout	TCP connection establishment	1-3s
Read timeout	Waiting for response data	5-30s
Total timeout	End-to-end including retries	30-60s

Without timeouts: Slow dependency blocks threads, pool exhausts, your service stops responding, callers cascade fail. A single missing timeout can take down an entire service mesh.

Deadline Propagation

In chains (A -> B -> C), propagate a shrinking deadline so downstream knows time remaining:

def handle_request(request):
    deadline = request.header("X-Deadline") or (now() + DEFAULT_TIMEOUT)
    remaining = deadline - now()
    if remaining <= 0:
        return error(504, "Deadline exceeded")
    downstream_timeout = remaining - LOCAL_BUFFER_MS
    return call_service_b(timeout=downstream_timeout, headers={"X-Deadline": deadline})

Scenario	Suggested Timeout
Internal service RPC	3-5s
Database query	5-10s
External third-party API	10-30s
User-facing API total	10-30s

Always set timeouts explicitly. Never rely on library defaults.

5. Bulkhead Pattern

Isolates failures so one degraded dependency does not exhaust shared resources.

# BAD: Shared pool - slow payment service blocks all 100 threads
executor = ThreadPoolExecutor(max_workers=100)

# GOOD: Isolated pools per dependency
payment_pool  = ThreadPoolExecutor(max_workers=30)
email_pool    = ThreadPoolExecutor(max_workers=20)
inventory_pool = ThreadPoolExecutor(max_workers=30)

Pool sizing: target_throughput * avg_latency_seconds + buffer (e.g., 100 req/s at 200ms = 30 threads).

Also apply to DB connection pools: separate pools for critical (order processing) vs non-critical (analytics) paths.

6. Saga Pattern

Manages distributed transactions across services where ACID is not possible.

Aspect	Choreography	Orchestration
Coordination	Services react to events	Central coordinator directs
Best for	Simple sagas (2-4 steps)	Complex sagas (5+ steps)
Visibility	Hard to trace	Easy to see full flow

Use when: Multi-service business processes, cross-database operations. NOT for single-database operations.

class OrderSaga:
    steps = [
        SagaStep(action=reserve_inventory,  compensate=release_inventory),
        SagaStep(action=charge_payment,     compensate=refund_payment),
        SagaStep(action=schedule_shipping,  compensate=cancel_shipping),
    ]

    def execute(self, order):
        completed = []
        for step in self.steps:
            try:
                step.action(order)
                completed.append(step)
            except Exception as e:
                for s in reversed(completed):  # Compensate in reverse
                    try:
                        s.compensate(order)
                    except Exception:
                        alert_ops_team(order, s)  # Manual intervention needed
                raise SagaFailed(e)

Key rules: Compensations must be idempotent. Persist saga state so it survives coordinator crashes.

7. Outbox Pattern

Solves the dual-write problem: updating a database AND publishing an event atomically.

def create_order(order):
    with db.transaction():  # Same transaction guarantees consistency
        db.execute("INSERT INTO orders (id, status) VALUES (?, ?)", order.id, 'created')
        db.execute("INSERT INTO outbox (id, event_type, payload) VALUES (?, ?, ?)",
                   uuid4(), 'OrderCreated', serialize(order))

Publishing approaches:

Approach	How	Trade-off
Polling publisher	Query outbox for unpublished rows	Simple; adds DB load
CDC (e.g., Debezium)	Read DB transaction log	Near real-time; more infra

def publish_outbox():  # Polling publisher
    events = db.query("SELECT * FROM outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 100")
    for event in events:
        message_broker.publish(event.event_type, event.payload)
        db.execute("UPDATE outbox SET published_at = now() WHERE id = ?", event.id)

Provides at-least-once delivery - consumers must be idempotent.

8. Graceful Degradation

Shed non-critical work under pressure to preserve core functionality.

Pressure Level	Response
Normal	Full functionality
Elevated	Disable recommendations, analytics
High	Return cached/fallback responses
Critical	Read-only mode, reject new work

def get_product(product_id):
    product = product_service.get(product_id)  # Always execute
    if feature_flags.enabled("recommendations"):
        product.recs = recommendation_service.get(product_id)
    else:
        product.recs = []  # Fallback
    return product

# Load shedding
def handle(request):
    if active_requests.get() >= MAX_CONCURRENT:
        return Response(503, headers={"Retry-After": "5"})
    active_requests.increment()
    try:
        return process(request)
    finally:
        active_requests.decrement()

Fallback chain: Try live service -> Try cache -> Return generic default.

Pattern Selection Guide

Problem	Primary Pattern	Complementary
Transient network errors	Retry + backoff	Idempotency, timeouts
Dependency frequently down	Circuit breaker	Retry (inside CB), fallbacks
Duplicate side effects	Idempotency keys	DB unique constraints
Slow dependency hangs service	Timeouts	Circuit breaker, bulkhead
One bad dep takes down all	Bulkhead	Circuit breaker, timeouts
Multi-service transaction	Saga	Idempotency (compensations)
Reliable event publishing	Outbox pattern	Idempotency (consumers)
System overloaded	Graceful degradation	Load shedding, queue leveling

observability.md

Observability Reference

Observability is not monitoring. Monitoring tells you when something is broken. Observability tells you why. The three pillars - logs, metrics, traces - are useless in isolation. They work together. Invest in correlation between them from day one.

1. Structured Logging

Use JSON Logs - Always

Plain text logs are for humans staring at a terminal. JSON logs are for systems that need to search, filter, and aggregate across millions of entries. Every service should emit structured JSON.

{
  "timestamp": "2025-03-14T10:23:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "request_id": "req-789",
  "user_id": "u-42",
  "method": "POST",
  "path": "/api/v1/charges",
  "status": 500,
  "duration_ms": 234,
  "error": "upstream timeout from billing-gateway",
  "message": "Failed to process charge"
}

Log Levels

Level	When to Use	Example
`TRACE`	Fine-grained debugging, never in production	Cache key lookup details
`DEBUG`	Development diagnostics, off by default in prod	SQL query parameters
`INFO`	Normal operations worth recording	Request completed, job started/finished
`WARN`	Recoverable issues that need attention	Retry succeeded, fallback used, pool low
`ERROR`	Failures that need investigation	Request failed, dependency timeout
`FATAL`	Process cannot continue	DB connection lost, config missing

Opinion: Default to INFO in production. If you are logging at DEBUG in production to diagnose an issue, you have a tracing or metrics gap.

Correlation IDs

Every request entering your system gets a correlation/trace ID. Propagate it through every service call.

Client -> API Gateway (generates trace_id: "abc123")
  -> Service A (logs with trace_id: "abc123", span_id: "span-1")
    -> Service B (logs with trace_id: "abc123", span_id: "span-2")
    -> Service C (logs with trace_id: "abc123", span_id: "span-3")

Propagation checklist:

Extract X-Request-ID or traceparent header on ingress
Generate one if missing (at the edge only)
Attach to thread-local / async context
Include in every log line automatically (middleware, not manual)
Forward in all outbound HTTP/gRPC/message headers

What to Log vs What NOT to Log

Log This	Never Log This
Request ID, trace ID	Passwords, tokens, API keys
User ID (pseudonymized if needed)	Credit card numbers, SSNs
HTTP method, path, status code	Full request/response bodies
Duration in milliseconds	PII (email, phone, address)
Error messages and codes	Session tokens or auth headers
Upstream service and latency	Database connection strings
Queue depth, retry count	Internal IP addresses (in public logs)

Opinion: If you log full request bodies "for debugging," you will eventually leak PII into your log aggregator. Log a content hash or size instead.

Log Aggregation Patterns

Ship logs via a sidecar or agent (Fluentd, Vector, Filebeat) - not directly from your app
Use a structured pipeline: App -> Agent -> Buffer (Kafka) -> Aggregator (Elasticsearch/Loki)
Set retention policies per log level: ERROR=90d, WARN=30d, INFO=14d
Index on: trace_id, service, level, status, user_id

2. Metrics

RED Method (for Services)

Use RED for any service that handles requests:

Signal	What It Measures	Metric Type	Example Metric Name
Rate	Requests per second	Counter	`http_requests_total`
Errors	Failed requests per second	Counter	`http_requests_errors_total`
Duration	Time per request	Histogram	`http_request_duration_seconds`

USE Method (for Resources)

Use USE for infrastructure - CPU, memory, disk, connections:

Signal	What It Measures	Example
Utilization	% of resource capacity in use	CPU at 72%, memory at 85%
Saturation	Work waiting (queue depth)	14 threads waiting for DB connection
Errors	Resource-level error events	Disk I/O errors, network drops

Metric Types

Type	Use When	Example
Counter	Value only goes up (resets on restart)	Total requests, total errors
Gauge	Value goes up and down	Current connections, queue depth
Histogram	You need distribution (p50, p95, p99)	Request latency, response size
Summary	Like histogram but calculated client-side	Avoid - prefer histograms for flexibility

Opinion: Default to histograms for latency. If you use averages for latency, you are hiding your tail - p99 problems are invisible in averages.

Naming Conventions

<namespace>_<subsystem>_<name>_<unit>

payment_service_http_request_duration_seconds   # Good
payment_service_orders_processed_total          # Good
requestTime                                     # Bad: no namespace, no unit
PaymentService.http                             # Bad: dots and camelCase

Rules:

Use snake_case
Include the unit as a suffix (_seconds, _bytes, _total)
Use _total suffix for counters
Prefix with service/namespace
Use base units (seconds not milliseconds, bytes not megabytes)

Cardinality Traps

High-cardinality labels will kill your metrics backend. Every unique label combination creates a new time series.

Label	Cardinality	Safe?
`method`	~5	Yes
`status_code`	~5 (grouped)	Yes
`endpoint`	~50	Usually
`customer_tier`	~4	Yes
`user_id`	Millions	NO
`request_id`	Infinite	NO
`email`	Millions	NO

Rule of thumb: If a label has more than ~100 unique values, do not put it on a metric. Use logs or traces for high-cardinality data.

3. Distributed Tracing

Core Concepts

Trace (trace_id: "abc123")
+-- Span A: API Gateway (span_id: "s1", 245ms)
|   +-- Span B: Auth Service (span_id: "s2", parent: "s1", 12ms)
|   +-- Span C: Order Service (span_id: "s3", parent: "s1", 220ms)
|       +-- Span D: DB Query (span_id: "s4", parent: "s3", 45ms)
|       +-- Span E: Cache Lookup (span_id: "s5", parent: "s3", 3ms)

Trace: The full journey of a request across all services
Span: A single unit of work within a trace (one service call, one DB query)
Context propagation: Passing trace_id + span_id between services via headers (traceparent)

OpenTelemetry Basics

OpenTelemetry (OTel) is the standard. Use it.

tracer = OpenTelemetry.getTracer("payment-service")

function processOrder(order):
    span = tracer.startSpan("process-order")
    span.setAttribute("order.id", order.id)
    try:
        result = chargePayment(order)
        span.setStatus(OK)
        return result
    catch error:
        span.setStatus(ERROR)
        span.recordException(error)
        raise
    finally:
        span.end()

Prefer auto-instrumentation for HTTP/gRPC/DB clients. Add manual spans for business logic.

Sampling Strategies

You cannot trace 100% of production traffic. Sampling is required.

Strategy	How It Works	Trade-off
Head-based	Decide at ingress whether to sample	Simple; misses interesting traces
Tail-based	Collect all spans, decide after trace completes	Catches errors and slow traces; costly
Rate-limited	Sample N traces per second	Predictable cost; biased sample

Opinion: Start with head-based sampling at 5-10%. Add tail-based sampling for errors and high-latency traces. Tail-based is harder to operate but catches the traces you actually need.

When Tracing Matters Most

Multi-service request chains (3+ hops)
Async workflows (message queues, event-driven flows)
Diagnosing latency - "where is the time spent?"
Debugging fan-out/fan-in patterns
Verifying retry and fallback behavior

Tracing is less useful for single-service, synchronous request-response. Logs and metrics are enough there.

4. SLOs and SLIs

Defining SLIs

An SLI (Service Level Indicator) is a measurement of your service's behavior that users care about.

SLI Type	Definition	Measurement
Availability	% of requests that succeed	`successful_requests / total_requests`
Latency	% of requests faster than threshold	`requests < 300ms / total_requests`
Throughput	Requests processed per time window	`requests_per_second`
Correctness	% of requests returning correct results	`correct_responses / total_responses`

Setting Realistic SLOs

An SLO (Service Level Objective) is a target for your SLI, measured over a time window.

SLO: 99.9% of requests succeed within 300ms, 30-day rolling window
Error budget: At 100 rps -> 259M requests/month -> 0.1% = 259,200 failures allowed

SLO Target	Downtime per Month	Realistic For
99%	7.3 hours	Internal tools, batch systems
99.9%	43.8 minutes	Most production APIs
99.95%	21.9 minutes	Customer-facing critical paths
99.99%	4.4 minutes	Payment processing, auth services
99.999%	26.3 seconds	Almost nobody needs this

Opinion: Start with 99.9%. If you have not operated a service at 99.9% successfully, do not set a 99.99% target - you will just create noise.

Error Budgets and Burn Rate

Error budget = 1 - SLO target. For 99.9% SLO, budget is 0.1%
Burn rate = how fast you consume budget. Rate of 1 = expected pace

Burn rate = error_rate_observed / error_rate_allowed
Example: 0.5% observed / 0.1% allowed = 5x burn -> 30-day budget gone in 6 days

SLO-Based vs Threshold-Based Alerting

Approach	Alert When	Problem
Threshold	Error rate > 1% for 5 minutes	Too many false alarms or missed issues
SLO-based	Burn rate threatens to exhaust budget	Smarter - accounts for sustained impact

SLO-based alerting (recommended):

Fast burn (14.4x): Alert in 2 min, page immediately - major incident
Slow burn (3x): Alert in 1 hour, create ticket - degradation trend

Example SLO for a Web Service

service: order-api
slos:
  - name: availability
    sli: successful_responses / total_responses  # (non-5xx)
    target: 99.9%
    window: 30d

  - name: latency
    sli: responses_under_300ms / total_responses
    target: 99.0%
    window: 30d

  - name: latency-critical
    sli: responses_under_1000ms / total_responses
    target: 99.9%
    window: 30d

5. Alerting

Alert Fatigue - Causes and Prevention

Cause	Fix
Alerts on symptoms, not impact	Alert on SLO burn rate, not CPU spikes
No ownership	Every alert has a team owner
Duplicate alerts	Deduplicate and group related alerts
Flapping alerts	Add hysteresis (alert on sustained condition)
Alerts nobody acts on	Delete them. If nobody investigates, it is noise

Opinion: If your on-call gets paged more than twice a week outside business hours, your alerts are broken. Fix the alerts, not the humans.

Actionable Alerts Checklist

Every alert MUST have:

What - clear description of the symptom
Impact - who/what is affected and how severely
Runbook link - step-by-step investigation/mitigation guide
Owner - which team owns this alert
Severity - page vs ticket vs notification
Dashboard link - relevant dashboard for context
Expected behavior - what "normal" looks like for comparison

If an alert does not have a runbook, it is not ready for production.

Severity Levels

Severity	Response	Example	SLA
P1	Page immediately	Service down, data loss, security breach	15 min ack
P2	Page during hours	Degraded performance, partial outage	1 hour ack
P3	Ticket, next day	Elevated error rate, slow burn on budget	Next biz day
P4	Ticket, backlog	Non-critical warning, capacity planning	1 week

Escalation Patterns

T+0 min  -> Primary on-call paged
T+15 min -> No ack? Secondary on-call paged
T+30 min -> No ack? Engineering manager paged
T+60 min -> No resolution? Incident commander engaged

On-Call Best Practices

Rotate weekly, never longer. Handoff meetings at each rotation
Track on-call burden: pages per shift, time-to-ack, hours spent
Blameless postmortems for every P1 and repeated P2
Budget 30% of sprint capacity for on-call follow-up work

6. Dashboards

The Four Golden Signals Dashboard

Every service gets a dashboard with these four panels:

Signal	Metric	Visualization
Latency	p50, p95, p99 request duration	Time series graph
Traffic	Requests per second	Time series graph
Errors	Error rate (5xx / total)	Time series graph
Saturation	CPU, memory, connection pool usage	Gauge + graph

Service Dashboard Template

Row 1: [SLO Status]  [Error Budget Remaining]  [Apdex Score]
Row 2: [Request Rate]  [Error Rate by Type]  [Latency p50/p95/p99]
Row 3: [Upstream Latency by Service]  [DB Query Duration]  [Cache Hit Rate]
Row 4: [CPU/Memory]  [Connection Pools]  [Queue Depth]
Row 5: [Orders/min]  [Revenue Impact]  [Active Users]

Dashboard Anti-Patterns

Anti-Pattern	Why It Hurts	Fix
30+ panels on one dashboard	Information overload, nothing stands out	8-12 panels max, link to details
Vanity metrics (total users)	Feels good, reveals nothing about health	Show rate-of-change, not totals
No time range context	Spikes look alarming without baseline	Add week-over-week overlay
Dashboard per person	Inconsistent views, duplicated work	One canonical dashboard per service
No annotations	Deploys and incidents invisible on graphs	Add deploy markers and incident bands

Quick Reference: Connecting the Three Pillars

"Orders are slow" ->
  1. Dashboard: p99 latency spiked at 14:32              (METRICS)
  2. Filter by endpoint: POST /orders is slow             (METRICS)
  3. Find traces for slow /orders requests                (TRACES)
  4. Trace shows 2s in inventory-service                  (TRACES)
  5. Search logs for inventory-service with trace_id      (LOGS)
  6. Log: "connection pool exhausted, waited 1.8s"        (LOGS)
  -> Root cause: DB pool too small after traffic increase

Metrics tell you something is wrong. Traces tell you where. Logs tell you why.

performance.md

Performance Reference

Practical guide for diagnosing and fixing backend performance problems. Assumes you have basic observability in place. If not, set that up first - you cannot optimize what you cannot measure.

1. Performance Diagnosis Workflow

Follow this order. Do not skip to profiling code before ruling out infrastructure.

1. CHECK METRICS (dashboards, APM)
   |
2. IDENTIFY BOTTLENECK CATEGORY
   +-- CPU bound?      -> high CPU %, low I/O wait
   +-- Memory bound?   -> high RSS, swap usage, OOM kills
   +-- I/O bound?      -> high iowait %, disk queue depth
   +-- Network bound?  -> high latency to dependencies, packet loss
   |
3. DRILL DOWN into the specific category (see sections below)
   |
4. REPRODUCE under controlled load -> FIX -> VERIFY under same load

Flame graphs answer "where is CPU time spent?" Read bottom-up. Wide bars at the top = leaf functions consuming time. Generate with perf (Linux), async-profiler (JVM), or py-spy (Python).

APM checklist:

RED metrics visible (Rate, Errors, Duration) per endpoint
Slow transaction traces enabled (threshold: 2x your p95)
Database query tracking with EXPLAIN capture
External call latency tracked per dependency

Load testing phases:

Phase	What to test	Tool examples
Smoke	Works at 1-5 users?	curl, httpie
Load	Expected traffic (1x baseline)	k6, Locust, wrk
Stress	2-5x traffic, find breaking point	k6, Gatling
Soak	Sustained hours, find leaks	k6 long duration

Always load test against production-like dataset sizes.

2. Query Optimization

Reading EXPLAIN plans

Key fields (PostgreSQL; similar concepts in MySQL):

EXPLAIN field	Good sign	Bad sign
`Seq Scan`	On tiny tables (<1000 rows)	On large tables - missing index
`Index Scan`	Selective index	Index returning >10% of rows
`Nested Loop`	Inner side uses index	Inner side is Seq Scan
`Hash Join`	Fits in memory	Spilling to disk
`Rows` (est vs actual)	Close match	Off by 10x+ - run ANALYZE

Always use EXPLAIN (ANALYZE, BUFFERS) for actual execution data.

Common anti-patterns

Anti-pattern	Fix
`SELECT *`	Select only needed columns
N+1 queries	JOIN, batch fetch, or dataloader
Missing index on WHERE/JOIN/ORDER BY	Add targeted index
Implicit type cast (`WHERE varchar_col = 123`)	Match types explicitly
`LIKE '%term%'`	Full-text search or trigram index
Function on indexed column (`WHERE YEAR(col) = 2024`)	Use range predicate
`DISTINCT` hiding a bad JOIN	Fix the JOIN

Query rewriting techniques

-- SLOW: correlated subquery runs per row
SELECT o.id FROM orders o
WHERE o.total > (SELECT AVG(total) FROM orders WHERE user_id = o.user_id);

-- FAST: window function, single pass
SELECT id FROM (
  SELECT id, total, AVG(total) OVER (PARTITION BY user_id) AS avg_total
  FROM orders
) sub WHERE total > avg_total;

Batch operations: Replace N individual inserts with multi-row INSERT INTO t (col) VALUES ('a'), ('b'), ('c'). Use prepared statements for any repeated query.

3. Connection Pooling

Why it matters

Opening a connection is expensive: TCP + TLS handshake + auth + server memory. Without pooling: 100 concurrent requests = 100 connections opened/closed. With pooling: 100 concurrent requests = 10-20 reused connections.

Pool sizing formula

connections = (core_count * 2) + spindle_count

For a 4-core server with SSDs: (4 * 2) + 0 = 8 connections.

Most teams set pool size too high. 50-100 per app instance is almost always wrong. More connections = more lock contention and memory on the DB. Start small, measure.

Connection leak detection

Symptoms: pool exhaustion after hours, connection count grows monotonically, works after restart then degrades.

pool.on('acquire', (conn) => {
  log.debug('acquired', conn.id, stack_trace())
  setTimeout(() => {
    if (conn.still_acquired)
      log.warn('possible leak', conn.id, conn.acquired_stack_trace)
  }, 30_000)
})
pool.on('release', (conn) => log.debug('released', conn.id))

PgBouncer modes

Mode	Behavior	Default choice?
Transaction pooling	Returned after each transaction	Yes - use this
Session pooling	Held for entire session	Only if you need temp tables/prepared stmts
Statement pooling	Returned after each statement	Rarely useful

Pool exhaustion fixes

Symptom	Fix
"pool exhausted" errors	Check for leaks first, then increase size
Queries queueing	Add query timeout, move slow queries to replica
Idle-in-transaction	Acquire connection only when executing queries
Burst errors after deploy	Add connection warmup, set min pool size

4. Async Patterns

When to use what

Pattern	Best for	Avoid when
Event loop (Node, asyncio)	I/O-bound: HTTP, DB, file reads	CPU work blocks the loop
Thread pool (Java, Go)	CPU-bound: image processing, parsing	Threads >> cores
Hybrid (worker threads)	Mixed workloads	Over-engineering simple I/O services

The rule: Waiting on external things? Async. Crunching data on CPU? Threads.

Promise/future patterns

// Fan-out (parallel I/O) - total time = max, not sum
results = await Promise.all([fetch_user(id), fetch_orders(id), fetch_prefs(id)])

// Settle all, handle partial failures
results = await Promise.allSettled([task1, task2, task3])
succeeded = results.filter(r => r.status === 'fulfilled')

Background job queues

Use when: work >500ms, user doesn't wait, must survive restarts, needs retry logic.

Queue checklist:

Jobs are idempotent (safe to retry)
Max retry count with exponential backoff
Dead letter queue for permanently failed jobs
Payload contains IDs, not full objects
Per-job-type processing timeout

Worker pool sizing

// I/O-bound:  core_count * (1 + wait_time / compute_time)
// CPU-bound:  core_count (or core_count - 1 for headroom)
// Mixed:      start at core_count * 2, adjust by measured utilization

5. Caching for Performance

Hot path identification

Sort endpoints by request volume
Check if response is cacheable (same input = same output)
Measure per-request cost (DB queries, external calls)
Cache where volume * cost is highest

Cache hit ratio targets

Cache type	Target	Below target means
Hot path (profiles, config)	>95%	Bad key design or TTL too short
Search/feeds	>80%	High cardinality, try partial caching
Aggregations	>70%	Data changes too fast
Sessions	>99%	Store misconfigured

A cache with <50% hit ratio is added latency and cost. Remove it.

Cache-aside with database fallback

function get_user(user_id):
  cached = cache.get("user:" + user_id)
  if cached != null: return cached

  user = db.query("SELECT ... WHERE id = ?", user_id)
  if user == null:
    cache.set("user:" + user_id, NULL_SENTINEL, ttl=60)  // prevent stampede
    return null

  cache.set("user:" + user_id, user, ttl=3600)
  return user

function update_user(user_id, data):
  db.update(...)
  cache.delete("user:" + user_id)  // DELETE, don't SET - avoids race conditions

Always invalidate by deleting, not updating. Delete + lazy reload avoids stale-write race conditions.

Materialized views as caching

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT date_trunc('day', created_at) AS day, SUM(amount), COUNT(*)
FROM orders GROUP BY 1;

REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;  -- no read lock

6. Memory and Resource Management

Memory leak detection

Symptom	Common cause
RSS grows over hours	Unbounded caches, event listener accumulation
GC pauses increasing	Objects promoted to old gen
OOM kills	Limit too low or actual leak

Process: Take heap snapshot after warmup. Run load 30min. Take second snapshot. Diff objects - growing counts = likely leak source.

GC tuning basics

Concern	Lever
Frequent minor GCs	Increase young gen size
Long major GC pauses	Use concurrent collector (G1GC, ZGC)
High GC overhead (>10% CPU)	Reduce allocation rate in code first

GC tuning is a last resort. Reduce allocation rate by fixing the code first.

Resource limits checklist

File descriptors - Set to 65535+. Default 1024 is too low. (ulimit -n)
DB connections - Match max_connections to total pool size across all instances
Memory limits - Leave 10-20% headroom above peak for GC and OS
TIME_WAIT sockets - Tune tcp_tw_reuse for high outbound connection rates

7. Latency Budgets

Budget allocation example (200ms total)

API Gateway: 5ms -> Auth: 15ms -> Main Service: 80ms -> Downstream: 50ms
                                   |-> DB: 30ms        Network overhead: 40ms
                                   |-> Cache: 5ms      Serialization: 10ms
                                   |-> Compute: 20ms

Reserve 20% for network overhead and variance
No single dependency gets >40% of total budget
Set timeouts equal to budget allocation at each hop

p50 vs p99

Percentile	What it tells you
p50	Typical experience - half see this or better
p95	Bad-day experience - 1 in 20 requests
p99	Worst realistic experience - often your biggest users

Optimize for p99 first. A service with p50=100ms, p99=150ms is better than p50=30ms, p99=3000ms. Consistency beats average speed. At 100 requests per session, every user hits p99 at least once.

Tail latency amplification

When fanning out to N parallel calls, overall latency = slowest response:

Fan-out	Impact	Mitigation
1-3 services	Manageable	Standard retry + timeout
5-10 services	p99 becomes common	Hedged requests (send to 2, take first)
10-50 services	Tail dominates	Deadline propagation, partial results
50+ services	Assume partial failure	Best-effort response, fill in async

Key mitigations:

Hedged requests - After p50 latency, send duplicate to another replica
Deadline propagation - Pass remaining budget downstream
Partial results - Return incomplete data with completeness indicator
Request coalescing - Deduplicate identical in-flight downstream calls

scalable-systems.md

Scalable Systems Reference

Opinionated defaults for mid-level backend engineers. Follow the defaults unless you have a measured reason to deviate.

1. Scaling Decision Tree

START: Is your service stateless?
  |
  +-- NO --> Make it stateless first. Move state to a database/cache/object store.
  |           Do NOT scale a stateful service horizontally without understanding the cost.
  |
  +-- YES --> Are you CPU or memory bound?
                +-- CPU bound, single-threaded bottleneck --> Scale UP (vertical) first.
                +-- Memory bound, dataset fits in one box --> Scale UP (vertical) first.
                +-- Already on the largest practical instance --> Scale OUT (horizontal).
                +-- Need fault tolerance across zones/regions --> Scale OUT (horizontal).
                +-- Traffic is spiky and you need elasticity --> Scale OUT (horizontal).

Vertical vs Horizontal - Decision Table

Signal	Scale UP	Scale OUT
Single-threaded bottleneck	YES	No
Need >99.99% availability	No	YES
Spiky/unpredictable traffic	No	YES
Simple ops team (< 3 engineers)	YES	No
Already on largest instance class	No	YES
Stateful workload hard to partition	YES	No

Default: Start vertical, go horizontal when forced.

Session Affinity Pitfalls

Session affinity (sticky sessions) makes horizontal scaling a lie - you get uneven load.
If one node goes down, all its sticky sessions are lost.
Fix: externalize session state to Redis/database. Make every request routable to any node.
Only use session affinity as a temporary migration step, never as a permanent architecture.

2. Caching Patterns

Pattern Selection

Pattern	Read Heavy	Write Heavy	Consistency Need	Complexity
Cache-aside	YES	No	Eventual OK	Low
Read-through	YES	No	Eventual OK	Medium
Write-through	No	YES	Strong needed	Medium
Write-behind	No	YES	Eventual OK	High

Default: Cache-aside. Simplest pattern, works for 80% of use cases.

Cache-Aside (Lazy Loading)

function get(key):
    value = cache.get(key)
    if value is null:
        value = database.get(key)
        cache.set(key, value, ttl=300)
    return value

function update(key, new_value):
    database.update(key, new_value)
    cache.delete(key)  // DELETE, not SET. Avoids race conditions.

Rule: On write, DELETE the cache entry. Do not try to update it. Updating creates race conditions between concurrent writes.

Read-Through

Same as cache-aside but the cache library handles the DB fetch. Use when your cache layer supports it natively. The application only talks to the cache.

Write-Through

function update(key, new_value):
    cache.set(key, new_value)  // Cache layer writes to DB synchronously before returning

Use when you need strong read-after-write consistency AND you read recently written data.

Write-Behind (Write-Back)

function update(key, new_value):
    cache.set(key, new_value)
    queue_async_write(key, new_value)  // Persisted to DB asynchronously

Use only when you can tolerate data loss on cache failure. Good for analytics counters, non-critical metrics.

TTL Strategy

Data Type	TTL	Reasoning
User profile	5-15 min	Changes infrequently, stale OK briefly
Product catalog	1-5 min	Moderate change rate
Session data	30 min	Matches session timeout
Config/feature flags	30-60 sec	Need fast propagation
Computed aggregations	1-15 min	Expensive to recompute

Default: 5 minutes. Adjust based on how much staleness your users tolerate.

Cache Invalidation Approaches

TTL expiry (default) - simplest, handles most cases
Explicit delete on write - use with cache-aside
Event-driven invalidation - publish invalidation events on write; subscribers delete keys
Version keys - append a version number to the cache key; bump version on write

Redis vs Memcached

Factor	Redis	Memcached
Data structures	Rich	Key-value
Persistence	Yes	No
Replication	Yes	No
Pub/Sub	Yes	No
Memory efficiency	Good	Better
Multi-threaded	No (mostly)	Yes
Max value size	512 MB	1 MB

Default: Redis. Unless you only need a simple, volatile key-value cache at massive scale with no persistence needs.

Multi-Layer Caching

Request -> L1 (in-process, e.g., Caffeine/Guava) -> L2 (distributed, e.g., Redis) -> Database

L1: In-process. Sub-millisecond. Small capacity (100 MB - 1 GB). Short TTL (30-60 sec).
L2: Distributed. Single-digit milliseconds. Large capacity. Longer TTL.

Use L1 when you have a hot key problem or read latency matters at p99. Skip L1 when data changes frequently and consistency matters, or your service has many instances (each L1 diverges).

Cache Stampede Prevention

When a popular key expires, hundreds of requests hit the database simultaneously.

Solution 1 - Locking (default):

function get_with_lock(key):
    value = cache.get(key)
    if value is null:
        if acquire_lock(key, timeout=5s):
            value = database.get(key)
            cache.set(key, value, ttl=300)
            release_lock(key)
        else:
            sleep(50ms)
            return get_with_lock(key)  // Retry, lock holder populates cache
    return value

Solution 2 - Probabilistic early expiry:

function get_with_early_recompute(key):
    value, expiry = cache.get_with_expiry(key)
    ttl_remaining = expiry - now()
    if value is not null AND random() < exp(-ttl_remaining / beta):
        value = database.get(key)
        cache.set(key, value, ttl=300)
    return value

Use locking for most cases. Use probabilistic early expiry for extremely hot keys where lock contention itself becomes a bottleneck.

3. Message Queues and Event-Driven Architecture

When to Use a Queue vs Direct Call

Scenario	Direct Call	Queue
Need synchronous response	YES	No
Caller does not need to know the result	No	YES
Downstream is unreliable or slow	No	YES
Need to fan out to multiple consumers	No	YES
Processing can be deferred	No	YES
Request-response within 100ms SLA	YES	No
Need retry with backoff	No	YES

Default: Direct call for queries, queue for commands/events that don't need an immediate response.

Pub/Sub vs Point-to-Point

Point-to-point (queue): One message, one consumer. Use for task distribution.
Pub/Sub (topic): One message, many consumers. Use for event notification (e.g., order placed - notify shipping AND billing AND analytics).

Delivery Semantics

Semantic	Guarantee	Use When
At-most-once	May lose messages	Metrics, logs (acceptable loss)
At-least-once	May duplicate messages	Most business logic (DEFAULT)
Exactly-once	No loss, no duplicates	Financial transactions

Default: At-least-once with idempotent consumers. Exactly-once is expensive and often an illusion - make your consumers idempotent instead.

function handle_message(msg):
    if already_processed(msg.idempotency_key):
        ack(msg)
        return
    process(msg)
    mark_processed(msg.idempotency_key)
    ack(msg)

Dead Letter Queues (DLQ)

Always configure a DLQ. Set max retry count to 3-5.
Monitor DLQ depth. Alert when it grows.
Build tooling to replay DLQ messages after fixing the bug.
Never silently drop messages.

Backpressure Handling Checklist

Set max queue depth. Reject or shed load when full.
Consumer scales with queue depth (autoscaling).
Producer has circuit breaker if queue is unavailable.
Monitor consumer lag. Alert when lag exceeds N minutes.
Have a plan for falling hours behind (skip, batch, prioritize).

Kafka vs RabbitMQ vs SQS

Factor	Kafka	RabbitMQ	SQS
Throughput	Very high (1M+/sec)	High (50K+/sec)	High (managed)
Ordering	Per-partition	Per-queue	Best-effort (FIFO*)
Message replay	Yes (retention)	No	No
Consumer groups	Native	Manual	No
Ops complexity	High	Medium	None (managed)
Exactly-once	Yes (with config)	No	FIFO dedup window
Best for	Event streaming,	Task queues,	Simple async,
	event sourcing	RPC patterns	serverless, AWS

Defaults:

AWS shop, simple async work - SQS
Need message replay or event streaming - Kafka
Complex routing, priority queues, RPC - RabbitMQ

4. Load Balancing

Algorithm Decision Table

Algorithm	Best For	Avoid When
Round-robin	Homogeneous backends, stateless	Backends have different capacity
Weighted round-robin	Heterogeneous backends	Load varies per request
Least connections	Varying request duration	Needs accurate connection counts
Consistent hashing	Caching layers, sticky needs	Backends frequently added/removed
Random	Simple, surprisingly effective	Need deterministic routing

Default: Round-robin for stateless services. Least connections when request latency varies significantly.

L4 vs L7 Load Balancing

Feature	L4 (Transport)	L7 (Application)
Inspects	TCP/UDP headers	HTTP headers, URL, body
Performance	Faster (less parsing)	Slower (full parsing)
Routing	IP + port only	Path, header, cookie
TLS termination	Pass-through or term	Always terminates
Cost	Lower	Higher
Use case	TCP services, DBs	HTTP APIs, web apps

Default: L7 for HTTP services (you almost always want path-based routing, header inspection, and TLS termination).

Health Check Checklist

Liveness check: is the process alive? (simple HTTP 200)
Readiness check: can it serve traffic? (DB connected, caches warm)
Health endpoint checks downstream dependencies with timeouts
Interval: 10 seconds. Threshold: 3 consecutive failures before removing.
Health check is lightweight - no expensive DB queries.

Connection Draining

When removing a backend from the pool:

Stop sending new requests to the node.
Let in-flight requests complete (drain timeout: 30 seconds).
If requests still in-flight after timeout, forcibly close.
Shut down the node.

Never skip connection draining during deployments. It causes user-visible errors.

5. Microservices Patterns

Service Boundary Checklist

The service owns its data (no shared database with another service).
The service can be deployed independently.
The service maps to a bounded context (one business domain).
The team can make changes without coordinating with other teams.
It is not so small that every operation requires cross-service calls.

If two services are always deployed together or always change together, merge them.

Sync vs Async Communication

Factor	Sync (HTTP/gRPC)	Async (Events/Queues)
Need immediate response	YES	No
Temporal coupling acceptable	YES	No
Fan-out to many services	No	YES
Resilience to downstream fail	Low	High
Debugging/tracing difficulty	Low	High

Default: Sync for queries (reads). Async for commands/events (writes that trigger side effects).

API Gateway Pattern

Place an API gateway in front of your microservices when:

Clients need a single entry point (mobile, SPA).
You need cross-cutting concerns: auth, rate limiting, request logging.
You want to aggregate responses from multiple services.

Do NOT build a custom gateway. Use an existing one (Kong, Envoy, AWS API Gateway).

Service Mesh

Consider a service mesh (Istio, Linkerd) when you have 15+ services and need standardized observability, mTLS, and traffic management. Skip the service mesh when you have fewer than 10 services - the operational overhead is not worth it.

Database Per Service

Non-negotiable. Each service owns its data store. Other services access data through the service's API, never by querying the database directly.

If you need data from another service:

Call their API (sync).
Consume their events and maintain a local read model (async, preferred for high-read scenarios).
Never share a database. It creates hidden coupling that makes independent deployment impossible.

6. CAP Theorem Applied

What CAP Actually Means

Consistency (C): Every read returns the most recent write.
Availability (A): Every request receives a response (not an error).
Partition Tolerance (P): System continues operating despite network partitions between nodes.

Key insight: Network partitions will happen. You do not get to opt out of P. Your real choice is between C and A during a partition.

Why "CA" Does Not Exist

A "CA" system assumes the network never partitions. In a distributed system, this is fantasy. A single-node PostgreSQL is technically "CA" but it is not distributed - it is just a database. The moment you add a second node, you must handle partitions.

CP vs AP - Decision Table

System Type	During Partition	Example Systems	Use When
CP	Rejects some requests	ZooKeeper, etcd, HBase, MongoDB*	Correctness > availability.
	to preserve consistency	(with majority write concern)	Financial data, leader election,
			distributed locks, inventory counts
AP	Serves possibly stale	Cassandra, DynamoDB, CouchDB, DNS	Availability > consistency.
	data to stay available		Shopping carts, social feeds,
			user preferences, analytics

*MongoDB is configurable - it can behave as CP or AP depending on read/write concern settings.

Real-World Examples

CP - Bank account balance: User withdraws during a partition. CP rejects the request rather than risk an overdraft. Temporary unavailability beats losing money.

AP - Social media feed: During a partition, a 30-second stale feed beats an error page. Users notice downtime but not slight staleness.

Practical Checklist

Identify: Is this data correctness-critical or availability-critical?
Correctness-critical (money, inventory, auth) - choose CP.
Availability-critical (feeds, recommendations, caches) - choose AP.
Most systems are a mix - different data stores for different needs.
Do not over-index on CAP. Partitions are rare. Focus on latency vs consistency tradeoffs for the common case.

Quick Reference - Opinionated Defaults

Decision	Default
Scaling strategy	Vertical first, horizontal when forced
Caching pattern	Cache-aside with TTL
Cache technology	Redis
Cache TTL	5 minutes
Cache invalidation	Delete on write + TTL expiry
Stampede prevention	Locking
Message queue (AWS)	SQS
Message queue (streaming)	Kafka
Delivery semantics	At-least-once + idempotent consumers
Load balancer algorithm	Round-robin (stateless), least-conn (varied)
Load balancer layer	L7 for HTTP services
Service communication	Sync for reads, async for events
Database sharing	Never. Database per service.
CAP for money	CP
CAP for feeds	AP

schema-design.md

Database Schema Design Reference

1. Normalization Levels

Normal Form	Rule	Violation Example
1NF	Every column holds atomic (single) values; no repeating groups	`tags = "api,auth,billing"` in one column
2NF	1NF + every non-key column depends on the entire primary key	Composite PK `(order_id, product_id)` with `customer_name` depending only on `order_id`
3NF	2NF + no non-key column depends on another non-key column	`zip_code -> city` stored alongside other customer fields

-- 1NF violation: repeating group
CREATE TABLE orders (id SERIAL PRIMARY KEY, items TEXT); -- "widget:3,gadget:1"
-- Fix: separate order_items table with one row per item

-- 2NF violation: partial key dependency
CREATE TABLE order_details (
  order_id INT, product_id INT, customer_name TEXT, quantity INT,
  PRIMARY KEY (order_id, product_id)
  -- customer_name depends only on order_id; move it to orders table
);

-- 3NF violation: transitive dependency
CREATE TABLE customers (
  id SERIAL PRIMARY KEY, name TEXT, zip_code TEXT,
  city TEXT  -- city depends on zip_code, not customer; extract to zip_codes table
);

When to Stop Normalizing

Default: normalize to 3NF. Deviate when you have measured evidence.

Scenario	Recommendation
OLTP with mixed reads/writes	3NF - normalize fully
Read-heavy analytics / dashboards	Denormalize into materialized views or reporting tables
Hot-path query joining 5+ tables	Denormalized read model alongside the normalized source
Write-heavy ingestion (logs, events)	Append-only flat tables; normalize downstream

Rule of thumb: Normalize the source of truth. Denormalize the read path. Never denormalize your only copy.

2. Indexing Strategies

Index Types

Index Type	Best For	Avoid When
B-tree (default)	Range queries, equality, sorting, `BETWEEN`	Full-text search, array containment
Hash	Exact equality lookups only	Range queries, sorting
GIN	Array containment (`@>`), full-text search, JSONB	Simple scalar equality; high-write tables
GiST	Geometric/spatial data, range types, nearest-neighbor	Simple equality or range on scalars

Default choice: B-tree. Switch only when the query pattern demands it.

Composite Index Column Ordering

Leftmost prefix rule - index on (a, b, c) supports (a), (a, b), (a, b, c), but NOT (b, c) alone.
Equality columns first, then range/sort columns.

-- Query: WHERE tenant_id = ? AND created_at > ? ORDER BY created_at
CREATE INDEX idx_tenant_created ON events (tenant_id, created_at);  -- correct
-- NOT: (created_at, tenant_id) -- breaks leftmost prefix for tenant-scoped queries

Covering and Partial Indexes

-- Covering: DB never touches the heap (index-only scan)
CREATE INDEX idx_users_email ON users (email) INCLUDE (name);

-- Partial: index only the rows you actually query
CREATE INDEX idx_orders_pending ON orders (created_at) WHERE status = 'pending';

When NOT to Index

Scenario	Why
Column with < 5 distinct values on a large table	Low selectivity; full scan is often faster
Write-heavy table with rarely queried columns	Every index slows INSERT/UPDATE/DELETE
Small tables (< 10k rows)	Sequential scan is fast enough
Columns only in `SELECT`, never in `WHERE`/`JOIN`/`ORDER BY`	Index won't be used

Reading EXPLAIN Output

Seq Scan          -- full table scan; missing index or low selectivity
Index Scan        -- good; using index, fetching from heap
Index Only Scan   -- best; all data from the index (covering index)
Bitmap Index Scan -- combines multiple indexes; okay for moderate selectivity
Nested Loop       -- fine for small outer sets; watch for large outer sets
Hash Join         -- good for large equi-joins
Sort              -- check for "external merge Disk" (spilling to disk = bad)

EXPLAIN checklist:

No unexpected Seq Scan on large tables
rows estimate within 10x of actual rows (with EXPLAIN ANALYZE)
No Sort Method: external merge Disk
Join order makes sense (smaller table drives the join)

3. Migration Safety Patterns

The Expand-Contract Pattern

Use for any rename, type change, or restructure with zero downtime.

Step	Action	Example
1. Expand	Add new structure alongside old	`ALTER TABLE users ADD COLUMN handle TEXT;`
2. Dual-write	Write to both columns; read new with fallback to old	Application code change
3. Backfill	Populate new column for existing rows (in batches)	`UPDATE users SET handle = username WHERE handle IS NULL LIMIT 5000;`
4. Cutover	All code reads/writes only new column	Application code change
5. Contract	Drop old column in subsequent deploy	`ALTER TABLE users DROP COLUMN username;`

Online DDL Safety

Operation	Safe Online?	Notes
`ADD COLUMN` (nullable, no default)	Yes	Metadata-only in most DBs
`ADD COLUMN` with default	Depends	PostgreSQL 11+ safe; older versions rewrite table
`DROP COLUMN`	Yes (logically)	Mark unused first
`RENAME COLUMN`	No	Breaks running application code instantly
`ALTER COLUMN TYPE`	No	May rewrite table; use expand-contract
`ADD INDEX`	Use `CONCURRENTLY`	Avoids write locks
`ADD NOT NULL`	Risky	Full table scan for validation; add as check constraint first

Zero-Downtime Migration Checklist

Migration only adds - no renames, drops, or type changes
New columns are nullable or have a safe default
Indexes created with CONCURRENTLY (or equivalent)
No full-table locks held for more than a few seconds
Application code backward-compatible with old and new schema
Backfills run in batches (1k-10k rows), not a single UPDATE
Rollback plan documented
Tested against production-sized dataset

Backfill Pattern

-- Pseudocode: batched backfill
cursor = 0
LOOP:
    rows = SELECT id FROM table WHERE new_col IS NULL AND id > cursor
           ORDER BY id LIMIT 5000
    IF rows IS EMPTY: BREAK
    UPDATE table SET new_col = derive(old_col) WHERE id IN (rows)
    cursor = MAX(rows.id)
    SLEEP 100ms  -- throttle to reduce replication lag

4. Schema Evolution

Nullable Columns vs New Tables

Approach	Use When
Nullable column	Data belongs to same entity; most rows will eventually have it; simple scalar
New table (1:1 FK)	Data is sparse (< 20% of rows); complex sub-object; different access patterns

JSON Columns

JSON is okay when...	JSON is tech debt when...
Schema-less by nature (user prefs, feature flags)	You query individual fields in WHERE clauses
Rarely queried, mostly stored/retrieved whole	You need referential integrity on nested data
Evolves faster than your deploy cycle	Multiple tables reference the same nested structure
Event payloads, audit log metadata	You write `data->>'field'` in every query

If you index a JSON field more than twice, extract it into a real column.

Soft Deletes vs Hard Deletes

Factor	Soft Delete (`deleted_at`)	Hard Delete (`DELETE`)
Data recovery	Easy - clear the flag	Requires backups
Query complexity	Every query needs `WHERE deleted_at IS NULL`	Queries are simple
Storage	Grows forever	Reclaims space
GDPR compliance	Does NOT satisfy "right to erasure"	Satisfies erasure
Performance over time	Table bloat slows queries	Table stays lean

Recommended default: Hard delete with an audit log table for recovery needs.

CREATE TABLE audit_log (
  id           BIGSERIAL PRIMARY KEY,
  table_name   TEXT NOT NULL,
  record_id    BIGINT NOT NULL,
  action       TEXT NOT NULL,  -- 'INSERT', 'UPDATE', 'DELETE'
  old_data     JSONB,
  new_data     JSONB,
  performed_by TEXT,
  performed_at TIMESTAMPTZ DEFAULT now()
);

If you must soft delete, use a partial index: CREATE INDEX idx_active ON users (email) WHERE deleted_at IS NULL;

Temporal Tables

When you need full history of every change, maintain a _history table:

CREATE TABLE products_history (
  history_id  BIGSERIAL PRIMARY KEY,
  id          INT NOT NULL,       -- FK to products
  name        TEXT NOT NULL,
  price_cents INT NOT NULL,
  valid_from  TIMESTAMPTZ NOT NULL,
  valid_to    TIMESTAMPTZ,        -- NULL = current version
  changed_by  TEXT
);
-- Populate via trigger or application hook on every UPDATE/DELETE

5. Data Types

Primary Keys: UUID vs Auto-Increment

Factor	Auto-Increment (`BIGSERIAL`)	UUID (v4 or v7)
Size	8 bytes	16 bytes
Index performance	Excellent (sequential, no page splits)	v4: poor (random). v7: good (time-sorted)
Multi-source merge	Conflicts guaranteed	No conflicts
URL guessability	Trivially enumerable	Not guessable
Client-side generation	Not possible	Possible (offline/distributed)

Recommended default: BIGSERIAL PK + separate public UUID column.

CREATE TABLE users (
  id        BIGSERIAL PRIMARY KEY,
  public_id UUID NOT NULL DEFAULT gen_random_uuid() UNIQUE,
  email     TEXT NOT NULL
);
-- Internal joins use 'id' (fast). APIs expose 'public_id' (safe).

For distributed/multi-region systems, use UUIDv7 (time-sorted) as the PK directly.

Timestamps

Always use TIMESTAMPTZ. No exceptions. Store UTC, convert in the application layer.

created_at TIMESTAMPTZ NOT NULL DEFAULT now()   -- good
created_at TIMESTAMP NOT NULL DEFAULT now()     -- bad: loses timezone context

Money

Store as integer cents. Never use FLOAT or DOUBLE.

Approach	Pros	Cons
`INT` (cents)	Fast, no rounding, easy math	Must convert for display
`NUMERIC(p,s)`	Exact decimals, multi-currency friendly	Slower than integer math
`FLOAT`/`DOUBLE`	Never use for money	`0.1 + 0.2 = 0.30000000000000004`

Enums: Column vs Lookup Table

Approach	Use When	Avoid When
Check constraint	< 5 values, almost never changes	Values change quarterly
Enum type (PG `CREATE TYPE`)	Moderate set, rarely changes	Need to remove values (PG enums can't drop)
Lookup table	Values change often, need metadata, referenced across tables	Only 2-3 static values

Recommended default: Check constraint for tiny static sets; lookup table for everything else.

-- Check constraint
CREATE TABLE orders (
  id     BIGSERIAL PRIMARY KEY,
  status TEXT NOT NULL CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered'))
);

-- Lookup table
CREATE TABLE order_statuses (
  id   SMALLINT PRIMARY KEY,
  name TEXT NOT NULL UNIQUE
);
CREATE TABLE orders (
  id        BIGSERIAL PRIMARY KEY,
  status_id SMALLINT NOT NULL REFERENCES order_statuses(id)
);

Quick Decision Cheat Sheet

Question	Default	Deviate When
How far to normalize?	3NF	Read-heavy reporting needs denormalized views
Which index type?	B-tree	Arrays/JSONB (GIN), spatial (GiST), pure equality (hash)
Composite index order?	Equality first, then range/sort	Benchmark proves otherwise
Column rename strategy?	Expand-contract	Maintenance window is acceptable
Soft or hard delete?	Hard delete + audit log	Undo feature required in product
UUID or auto-increment?	BIGSERIAL + public UUID	Distributed system (use UUIDv7)
Timestamp type?	TIMESTAMPTZ, always	Never deviate
Money storage?	Integer cents	Multi-currency precision (use NUMERIC)
Enum or lookup table?	Check constraint < 5; lookup table otherwise	PG enum if values never removed
JSON column?	Only for schema-less data	Extract fields you query or index

security.md

Security Reference

Opinionated security guidance for backend engineers. When in doubt, pick the more restrictive option. Security mistakes are silent until they are catastrophic.

1. Authentication Patterns

Comparison

Method	Best for	Token location	Revocable?	Stateless?
Session-based	Traditional web apps	httpOnly cookie (server stores session)	Yes (delete from store)	No
JWT	SPAs, mobile, microservices	httpOnly cookie (preferred)	Not natively (use deny-list)	Yes
API keys	Service-to-service, third-party	Authorization header	Yes (rotate key)	Yes

Recommendation: Use JWT in httpOnly cookies for user-facing apps. Use API keys + HMAC signatures for service-to-service. Never store tokens in localStorage or sessionStorage - XSS can steal them.

OAuth 2.0 / OIDC Flows

Flow	Use when
Authorization Code + PKCE	User-facing web/mobile apps (always use PKCE, even for confidential clients)
Client Credentials	Service-to-service with no user context
Device Code	CLI tools, smart TVs, IoT

Never use Implicit flow - it is deprecated. Never use Resource Owner Password flow unless migrating a legacy system and you control both client and server.

Token Refresh Pattern

1. Access token: short-lived (5-15 minutes)
2. Refresh token: longer-lived (7-30 days), stored httpOnly cookie, rotated on use
3. On 401 -> call /token/refresh with refresh token
4. Server issues new access + refresh token pair, invalidates old refresh token
5. If refresh token is reused (already rotated) -> revoke entire session (token theft)

MFA Considerations

Prefer TOTP (app-based) over SMS (SIM-swapping risk)
Support WebAuthn/passkeys as the strongest option
Store MFA secrets encrypted at rest, never in plain text
Implement MFA as a separate authentication step - do not issue full session until MFA completes
Provide recovery codes (one-time use, hashed like passwords)

2. Authorization

Model Comparison

Model	How it works	Best for	Complexity
RBAC	User -> Role -> Permissions	Simple apps, internal tools	Low
ABAC	Policy evaluated against user/resource/environment attributes	Fine-grained rules, compliance-heavy	Medium
ReBAC	Permissions derived from relationships (user owns resource)	Social apps, multi-tenant, Google Zanzibar-style	High

Recommendation: Start with RBAC. Move to ReBAC when you need relationship-based checks (e.g., "user can edit this document because they are in the owning org"). Use ABAC when you need policy rules that combine time, location, and resource attributes.

Enforcement Rules

Default deny - if no rule explicitly grants access, deny
Enforce at middleware layer - every route/resolver must pass through auth check
Never trust client-side auth checks alone - they are UX hints, not security
Check permissions on the resource, not just the route - a valid user hitting /api/orders/123 must own order 123 or have admin role

Row-Level Security Pattern

// Middleware approach
function authorize(user, resource_id):
    resource = db.find(resource_id)
    if resource is null:
        return 404  // don't leak existence
    if user.role == "admin":
        return allow
    if resource.owner_id != user.id:
        return 403
    return allow

// Database approach (Postgres RLS)
// ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
// CREATE POLICY user_orders ON orders
//   USING (owner_id = current_setting('app.user_id')::int);

Permission Checking Patterns

// Centralized permission check - preferred
function checkPermission(user, action, resource):
    permissions = getPermissionsForRole(user.role)
    if action not in permissions:
        throw Forbidden
    if resource.requiresOwnership and resource.owner_id != user.id:
        throw Forbidden

// Usage in every handler
function handleDeleteOrder(request):
    checkPermission(request.user, "orders:delete", order)
    // ... proceed

3. OWASP Top 10 - Practical Prevention

#	Vulnerability	Prevention	Code pattern
A01	Broken Access Control	Default deny, server-side checks on every request, deny by HTTP method	`checkPermission()` middleware on all routes
A02	Cryptographic Failures	TLS everywhere, AES-256 at rest, Argon2 for passwords	Never log secrets, PII, or tokens
A03	Injection	Parameterized queries, ORM with bind params, never string-concat SQL	`db.query("SELECT * FROM users WHERE id = $1", [id])`
A04	Insecure Design	Threat modeling, abuse case stories, rate limiting business logic	Limit failed login attempts per account
A05	Security Misconfiguration	Harden defaults, disable debug in prod, remove unused endpoints	Automated config scanning in CI
A06	Vulnerable Components	Dependency scanning (Dependabot, Snyk), pin versions, update regularly	`npm audit` / `pip audit` in CI pipeline
A07	Auth Failures	MFA, rate limit login, lock after N failures, credential stuffing protection	Argon2 hashing, no default credentials
A08	Data Integrity Failures	Verify signatures, use integrity checks on CI/CD, sign artifacts	Verify dependency checksums
A09	Logging Failures	Log auth events, access denied, input validation failures	Structured logs with user ID, IP, action
A10	SSRF	Allowlist outbound URLs, block internal IPs, validate/sanitize URLs	Reject `127.0.0.1`, `169.254.x.x`, `10.x.x.x` in user input

Injection Prevention Checklist

All SQL uses parameterized queries or ORM with bind parameters
No string concatenation or template literals in any query
NoSQL queries use typed parameters (no raw $where or $regex from user input)
OS command execution is avoided; if unavoidable, use allowlist of commands
LDAP queries use proper escaping

Rate Limiting Login

// Per-account + per-IP limiting
function handleLogin(ip, email, password):
    if rateLimiter.isBlocked(key="login:" + email, max=5, window=15min):
        return 429 "Too many attempts, try again later"
    if rateLimiter.isBlocked(key="login-ip:" + ip, max=20, window=15min):
        return 429
    result = authenticate(email, password)
    if result == failure:
        rateLimiter.increment("login:" + email)
        rateLimiter.increment("login-ip:" + ip)
        return 401  // use generic message: "Invalid email or password"
    rateLimiter.reset("login:" + email)
    return 200 with session

4. Secrets Management

Environment by Maturity

Environment	Approach	Acceptable?
Local dev	`.env` file (in `.gitignore`)	Yes
CI/CD	Pipeline secrets (GitHub Actions secrets, GitLab CI vars)	Yes
Staging/Prod	Secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager)	Required
Anywhere	Hardcoded in source code	Never

Recommendations

Use HashiCorp Vault for multi-cloud or on-prem. Use AWS Secrets Manager if all-in on AWS. Use GCP Secret Manager if all-in on GCP.
Rotate secrets regularly - automate rotation with the secrets manager's built-in rotation (e.g., AWS Secrets Manager supports Lambda-based rotation)
Least privilege - each service gets only the secrets it needs
Audit access - log every secret read

Secret Hygiene Checklist

.env is in .gitignore
No secrets in Dockerfiles, docker-compose files, or Kubernetes manifests
No secrets in application logs (mask or redact)
Secret scanning enabled in CI (e.g., gitleaks, trufflehog, GitHub secret scanning)
Pre-commit hook blocks commits containing high-entropy strings or known patterns
Secrets are fetched at runtime, not baked into container images
All secrets have a documented rotation schedule

If a Secret Leaks

1. Rotate the compromised secret immediately
2. Audit access logs for unauthorized use
3. Determine blast radius (what systems/data were accessible?)
4. Notify affected parties per incident response policy
5. Post-mortem: how did it leak? Fix the process.

5. Encryption

Encryption at Rest and in Transit

Context	Algorithm	Notes
Data in transit	TLS 1.3 (minimum TLS 1.2)	Terminate at load balancer, re-encrypt to backend if needed
Data at rest (general)	AES-256-GCM	Use cloud provider's KMS for key management
Password hashing	Argon2id	Preferred. Fallback: bcrypt (cost 12+). Never MD5, SHA-1, SHA-256
API request signing	HMAC-SHA256	Sign method + path + timestamp + body hash
File integrity	SHA-256	For checksums and verification only, never for passwords

Password Hashing Decision

Use Argon2id (winner of Password Hashing Competition)
  - memory: 64MB minimum (higher is better)
  - iterations: 3 minimum
  - parallelism: 1 (or match CPU cores)

If Argon2id unavailable:
  Use bcrypt with cost factor 12+

If bcrypt unavailable:
  Use scrypt with N=2^15, r=8, p=1

Never use:
  - MD5, SHA-1, SHA-256 for passwords (fast hashes = fast brute force)
  - Unsalted hashes of any kind
  - Custom/homegrown hashing schemes

HMAC for API Signatures

// Signing (client side)
function signRequest(method, path, body, secret, timestamp):
    payload = method + "\n" + path + "\n" + timestamp + "\n" + sha256(body)
    signature = hmac_sha256(secret, payload)
    return base64(signature)

// Verification (server side)
function verifyRequest(request, secret):
    timestamp = request.headers["X-Timestamp"]
    if abs(now() - timestamp) > 5 minutes:
        return reject  // replay attack prevention
    expected = signRequest(request.method, request.path, request.body, secret, timestamp)
    if not constantTimeCompare(expected, request.headers["X-Signature"]):
        return reject
    return accept

Key Management Rules

Never store encryption keys alongside encrypted data
Use cloud KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault) for key management
Implement key rotation - encrypt new data with new key, re-encrypt old data on access
Use envelope encryption: KMS encrypts a data key, data key encrypts actual data
Log all key usage for audit

6. Security Headers and Transport

Required Headers

Header	Value	Purpose
`Strict-Transport-Security`	`max-age=63072000; includeSubDomains; preload`	Force HTTPS for 2 years
`Content-Security-Policy`	`default-src 'self'; script-src 'self'`	Prevent XSS via inline scripts
`X-Content-Type-Options`	`nosniff`	Prevent MIME type sniffing
`X-Frame-Options`	`DENY`	Prevent clickjacking
`Referrer-Policy`	`strict-origin-when-cross-origin`	Limit referrer leakage
`Permissions-Policy`	`camera=(), microphone=(), geolocation=()`	Disable unused browser features

CORS Configuration

// Be explicit. Never use origin: "*" with credentials.
cors_config = {
    allowed_origins: ["https://app.example.com"],  // explicit list, no wildcards
    allowed_methods: ["GET", "POST", "PUT", "DELETE"],
    allowed_headers: ["Authorization", "Content-Type"],
    allow_credentials: true,
    max_age: 86400  // preflight cache: 24 hours
}

CORS rules:

Never allow * origin with credentials: true
Allowlist specific origins, do not reflect the Origin header blindly
Keep max_age high to reduce preflight requests

Rate Limiting

Strategy	Use case	Implementation
Token bucket	User-facing APIs	Allows bursts, refills at steady rate
Fixed window	Internal services	Simple, predictable
Sliding window	High-accuracy rate limiting	Smooths boundary spikes

Always return 429 Too Many Requests with Retry-After header
Rate limit by authenticated user ID, fall back to IP for unauthenticated requests
Set request body size limits (e.g., 1MB default, higher for file upload endpoints)

Input Sanitization

Validate at system boundaries (API gateway, controller layer)
Use allowlists for expected formats (regex for email, UUID pattern for IDs)
Reject unexpected fields (strict schema validation)
Encode output contextually (HTML-encode for HTML, JSON-encode for JSON)
Never trust Content-Type header alone - validate actual content

7. Security Checklist for New Services

Use this checklist before deploying any new backend service.

Authentication and Authorization

Authentication is required on all non-public endpoints
Authorization checks enforce default deny
Permissions checked per-request at middleware level, not just at the edge
Password hashing uses Argon2id (or bcrypt with cost 12+)
Session/token expiration is configured (access: 15min, refresh: 30 days max)
Failed login attempts are rate-limited per account and per IP
MFA is available for privileged accounts

Transport and Encryption

TLS 1.2+ required on all endpoints (TLS 1.3 preferred)
HSTS header set with long max-age
Sensitive data encrypted at rest using AES-256-GCM
Database connections use TLS
Internal service-to-service calls use mTLS or signed requests

Secrets

No secrets in source code, Dockerfiles, or CI config files
Secrets loaded from a secrets manager at runtime
.env in .gitignore
Secret scanning enabled in CI pipeline
Rotation schedule documented for all secrets

Dependencies

Dependency vulnerability scanning in CI (Dependabot, Snyk, or npm audit)
No dependencies with known critical CVEs
Dependencies pinned to specific versions (lockfile committed)
Base container images are minimal and regularly updated

Logging and Monitoring

Authentication events logged (login, logout, failure, MFA)
Authorization failures logged with user ID, IP, and attempted resource
No secrets, tokens, or passwords in logs
PII in logs is masked or excluded
Alerts configured for anomalous auth patterns (spike in 401s/403s)

Input Validation

All SQL queries use parameterized statements
Request body size limits enforced
Input validated against strict schemas at API boundary
File uploads validated for type, size, and scanned for malware
CORS configured with explicit origin allowlist

Security Headers

HSTS, CSP, X-Content-Type-Options, X-Frame-Options all set
CORS does not use wildcard origin with credentials
API versioning in place to avoid breaking changes silently

Quick Decision Reference

"Which auth pattern should I use?"

User-facing web app -> OAuth 2.0 Authorization Code + PKCE, JWT in httpOnly cookie
Mobile app -> Same as web, with secure storage for refresh token
Service-to-service -> Client Credentials flow or API key + HMAC
Third-party integration -> API key with scoped permissions

"Which hashing algorithm?"

Passwords -> Argon2id, always
Integrity checks -> SHA-256
API signing -> HMAC-SHA256
Never -> MD5, SHA-1 for anything security-related

"Where do I put secrets?"

Dev -> .env file (gitignored)
CI -> Pipeline secret variables
Prod -> Secrets manager (Vault / AWS / GCP), fetched at runtime

Frequently Asked Questions

What is backend-engineering?

How do I install backend-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support backend-engineering?

backend-engineering works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.

Is backend-engineering free?

Yes, backend-engineering is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between backend-engineering and similar tools?

backend-engineering is an AI agent skill that teaches your coding agent specialized software engineering knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use backend-engineering with Cursor or Windsurf?

backend-engineering works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

backend-engineering

What is backend-engineering?

Quick Start

backend-engineering

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is backend-engineering?

How do I install backend-engineering?

What AI agents support backend-engineering?

Maintainers

SKILL.md

Backend Engineering

When to use this skill

Key principles

Core concepts

Common tasks

Design a database schema

Choose a scaling strategy

Set up observability

Diagnose a performance issue

Secure a service

Design an API

Handle partial failures

Plan data consistency

Anti-patterns / common mistakes

Gotchas

References

References

api-design.md

API Design Reference

1. REST Conventions

Resource Naming

HTTP Method Semantics

Status Codes

Nested vs Flat

HATEOAS

2. GraphQL

When It Makes Sense

Schema Design

N+1 and DataLoader

Query Complexity Limits

3. gRPC

When gRPC Shines

Protobuf Schema Design

Unary vs Streaming

Error Model

Browser Clients

4. Pagination

Cursor vs Offset Decision

Cursor Implementation

Page Size & Response

5. Versioning

Strategy

Breaking vs Non-Breaking

Sunset Process

Multiple Versions

6. Rate Limiting

Algorithms

Headers and 429 Response

Scopes and Tiers

7. API Design Checklist

Before Shipping Any Endpoint

RFC 7807 Error Format

Quick Decision Defaults

failure-patterns.md

Failure Handling Patterns

1. Retry Patterns

2. Circuit Breaker

3. Idempotency

4. Timeouts

Deadline Propagation

5. Bulkhead Pattern

6. Saga Pattern

7. Outbox Pattern

8. Graceful Degradation