backend-engineering
Use this skill when designing backend systems, databases, APIs, or services. Triggers on schema design, database migrations, indexing strategies, distributed systems architecture, microservices, caching, message queues, observability setup, logging, metrics, tracing, SLO/SLI definition, performance optimization, query tuning, security hardening, authentication, authorization, API design (REST, GraphQL, gRPC), rate limiting, pagination, and failure handling patterns. Acts as a senior backend engineering advisor for mid-level engineers leveling up.
engineering backenddatabasesapi-designdistributed-systemssecurityobservabilityWhat is backend-engineering?
Use this skill when designing backend systems, databases, APIs, or services. Triggers on schema design, database migrations, indexing strategies, distributed systems architecture, microservices, caching, message queues, observability setup, logging, metrics, tracing, SLO/SLI definition, performance optimization, query tuning, security hardening, authentication, authorization, API design (REST, GraphQL, gRPC), rate limiting, pagination, and failure handling patterns. Acts as a senior backend engineering advisor for mid-level engineers leveling up.
backend-engineering
backend-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex, and 1 more. Designing backend systems, databases, APIs, or services.
Quick Facts
| Field | Value |
|---|---|
| Category | engineering |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex, mcp |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering- The backend-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
A senior backend engineer's decision-making framework for building production systems. This skill covers the six pillars of backend engineering - schema design, scalable systems, observability, performance, security, and API design - with an emphasis on when to use each pattern, not just how. Designed for mid-level engineers (3-5 years) who know the basics and need opinionated guidance on trade-offs.
Tags
backend databases api-design distributed-systems security observability
Platforms
- claude-code
- gemini-cli
- openai-codex
- mcp
Related Skills
Pair backend-engineering with these complementary skills:
Frequently Asked Questions
What is backend-engineering?
Use this skill when designing backend systems, databases, APIs, or services. Triggers on schema design, database migrations, indexing strategies, distributed systems architecture, microservices, caching, message queues, observability setup, logging, metrics, tracing, SLO/SLI definition, performance optimization, query tuning, security hardening, authentication, authorization, API design (REST, GraphQL, gRPC), rate limiting, pagination, and failure handling patterns. Acts as a senior backend engineering advisor for mid-level engineers leveling up.
How do I install backend-engineering?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support backend-engineering?
This skill works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Backend Engineering
A senior backend engineer's decision-making framework for building production systems. This skill covers the six pillars of backend engineering - schema design, scalable systems, observability, performance, security, and API design - with an emphasis on when to use each pattern, not just how. Designed for mid-level engineers (3-5 years) who know the basics and need opinionated guidance on trade-offs.
When to use this skill
Trigger this skill when the user:
- Designs a database schema or plans a migration
- Chooses between monolith vs microservices or evaluates scaling strategies
- Sets up logging, metrics, tracing, or alerting
- Diagnoses a performance issue (slow queries, high latency, memory pressure)
- Implements authentication, authorization, or secrets management
- Designs a REST, GraphQL, or gRPC API
- Needs retry, circuit breaker, or idempotency patterns
- Plans data consistency across services (sagas, outbox, eventual consistency)
Do NOT trigger this skill for:
- Frontend-only concerns (CSS, React components, browser APIs)
- DevOps/infra provisioning (use a Terraform/Docker/K8s skill instead)
Key principles
Design for failure, not just success - Every network call can fail. Every disk can fill. Every dependency can go down. The question is not "will it fail" but "how does it degrade?" Design graceful degradation paths before writing the happy path.
Observe before you optimize - Never guess where the bottleneck is. Instrument first, measure second, optimize third. A 10ms query called 1000 times matters more than a 500ms query called once.
Simple until proven otherwise - Start with a monolith, a single database, and synchronous calls. Add complexity (microservices, queues, caches) only when you have evidence the simple approach fails. Every architectural boundary is a new failure mode.
Secure by default, not by afterthought - Auth, input validation, and encryption are not features to add later. They are constraints to build within from day one. Use established libraries. Never roll your own crypto.
APIs are contracts, not implementation details - Once published, an API is a promise. Design from the consumer's perspective inward. Version explicitly. Break nothing silently.
Core concepts
Backend engineering is the discipline of building reliable, performant, and secure server-side systems. The six pillars form a hierarchy:
Schema design is the foundation - get the data model wrong and everything built on top inherits that debt. Scalable systems define how components communicate and grow. Observability gives you eyes into what's actually happening in production. Performance is the art of making it fast after you've made it correct. Security is the set of constraints that keep the system trustworthy. API design is the surface area through which consumers interact with all of the above.
These pillars are not independent. A bad schema creates performance problems. Poor observability makes security incidents invisible. A poorly designed API forces clients into patterns that break your scaling strategy. Think of them as a connected system, not a checklist.
Common tasks
Design a database schema
Start from access patterns, not entity relationships. Ask: "What queries will this serve?" before drawing a single table.
Decision framework:
- Read-heavy, predictable queries -> Normalize (3NF), add targeted indexes
- Write-heavy, high throughput -> Consider denormalization, append-only tables
- Complex relationships with traversals -> Consider a graph model
- Unstructured/evolving data -> Document store (but think twice)
Indexing rule of thumb: Index columns that appear in WHERE, JOIN, and ORDER BY.
A composite index on (a, b, c) serves queries on (a), (a, b), and (a, b, c)
but NOT (b, c). Check the references/ file for detailed indexing strategies.
Always plan migration rollbacks. A deploy that adds a column is safe. A deploy that drops a column is a one-way door. Use expand-contract migrations for breaking changes.
Choose a scaling strategy
Is a single server sufficient?
YES -> Stay there. Optimize vertically first.
NO -> Is the bottleneck compute or data?
COMPUTE -> Horizontal scale with stateless services + load balancer
DATA -> Is it read-heavy or write-heavy?
READ -> Add read replicas, then caching layer
WRITE -> Partition/shard the databaseOnly introduce microservices when you have: (a) independent deployment needs, (b) different scaling profiles per component, or (c) team boundaries that demand it.
Never split a monolith along technical layers (API service, data service). Split along business domains (orders, payments, inventory).
Set up observability
Implement the three pillars with correlation:
| Pillar | What it answers | Tool examples |
|---|---|---|
| Logs | What happened? | Structured JSON logs with correlation IDs |
| Metrics | How is the system performing? | RED metrics (Rate, Errors, Duration) |
| Traces | Where did time go? | Distributed traces across service boundaries |
Define SLOs before writing alerts. An SLO like "99.9% of requests complete in <200ms" gives you an error budget. Alert when the burn rate threatens the budget, not on every spike.
Diagnose a performance issue
Follow this checklist in order:
- Check metrics - is it CPU, memory, I/O, or network?
- Check slow query logs - are there N+1 patterns or full table scans?
- Check connection pools - are connections exhausted or leaking?
- Check external dependencies - is a downstream service slow?
- Profile the code - only after ruling out infrastructure causes
The fix for "the database is slow" is almost never "add more database." It's usually: add an index, fix an N+1, or cache a hot read path.
Secure a service
Minimum security checklist for any backend service:
- Authentication: Use OAuth 2.0 / OIDC for user-facing, API keys + HMAC for service-to-service. Never store plain-text passwords (bcrypt/argon2 minimum).
- Authorization: Implement at the middleware level. Default deny. Check permissions on every request, not just at the edge.
- Input validation: Validate at system boundaries. Use allowlists, not blocklists. Parameterize all SQL queries.
- Secrets: Use a secrets manager (Vault, AWS Secrets Manager). Never commit secrets to git. Rotate regularly.
- Transport: TLS everywhere. No exceptions.
Design an API
REST decision table:
| Need | Pattern |
|---|---|
| Simple CRUD | REST with standard HTTP verbs |
| Complex queries with flexible fields | GraphQL |
| High-performance internal service calls | gRPC |
| Real-time bidirectional | WebSockets |
| Event notification to external consumers | Webhooks |
Pagination: Use cursor-based for large/changing datasets, offset-based only for
small/static datasets. Always include a next_cursor field.
Versioning: URL path versioning (/v1/) for public APIs, header versioning for
internal. Never break existing consumers silently.
Rate limiting: Token bucket for user-facing, fixed window for internal. Always
return Retry-After headers with 429 responses.
Handle partial failures
When services depend on other services, failures cascade. Use these patterns:
- Retry with exponential backoff + jitter - for transient failures (network blips, 503s). Cap at 3-5 retries.
- Circuit breaker - stop calling a failing dependency. States: closed (normal) -> open (failing, fast-fail) -> half-open (testing recovery).
- Idempotency keys - make retries safe. Every mutating operation should accept an idempotency key so duplicate requests produce the same result.
- Timeouts - always set them. A missing timeout is an unbounded resource leak.
Plan data consistency
For distributed data across services:
- Strong consistency needed? -> Single database, ACID transactions
- Can tolerate eventual consistency? -> Event-driven with outbox pattern
- Multi-step business process? -> Saga pattern (prefer choreography over orchestration for simple flows, orchestration for complex ones)
The outbox pattern: write the event to a local "outbox" table in the same transaction as the data change. A separate process publishes outbox events to the message broker. This guarantees at-least-once delivery without 2PC.
Anti-patterns / common mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
| Premature microservices | Creates distributed monolith, adds network failure modes | Start monolith, extract services when domain boundaries are proven |
| Missing indexes on query columns | Full table scans under load, cascading timeouts | Profile queries with EXPLAIN, add indexes for WHERE/JOIN/ORDER BY |
| Logging everything, alerting on nothing | Alert fatigue, real incidents get buried | Structured logs with levels, SLO-based alerting on burn rate |
| N+1 queries in loops | Linear query growth per record, kills DB under load | Batch fetches, eager loading, or dataloader pattern |
| Rolling your own auth/crypto | Subtle security bugs that go unnoticed for months | Use battle-tested libraries (bcrypt, passport, OIDC providers) |
| Designing APIs from the database out | Leaks internal structure, painful to evolve | Design from consumer needs inward, then map to storage |
| Destructive migrations without rollback | One-way door that can cause downtime | Expand-contract pattern, backward-compatible migrations |
| Caching without invalidation strategy | Stale data, cache-database drift, inconsistency | Define TTL, invalidation triggers, and cache-aside pattern upfront |
Gotchas
Expand-contract is the only safe way to remove a column - Deploying code that removes a column before the column is dropped from the database causes immediate errors. Deploying a migration that drops a column while old code still reads it causes the same. The only safe path: deploy new code that ignores the old column, then deploy the migration that drops it, then optionally clean up the code.
Connection pool exhaustion looks like a slow database - When all connections in the pool are in use, new queries queue up indefinitely. Profiling shows slow queries; the real problem is too many concurrent requests or a connection leak. Check pool metrics (active, idle, waiting) before blaming the database.
Outbox pattern requires an idempotent consumer - The outbox pattern guarantees at-least-once delivery. If your message consumer isn't idempotent, it will process the same event twice after a crash and a restart. Every consumer must be able to handle duplicate messages safely.
N+1queries in ORM code are invisible until production load - Fetching a list of 50 orders and then calling.customeron each in a loop generates 51 queries. In development with 5 rows it's imperceptible; under production load it causes cascading timeouts. Always check query counts in integration tests and use eager loading for related data.Circuit breakers need a half-open timeout - A circuit that opens on failure and never closes traps a service in permanent degraded mode even after the downstream dependency recovers. Always configure a half-open probe interval so the breaker tests recovery and transitions back to closed state automatically.
References
For detailed patterns and implementation guidance on specific domains, read the
relevant file from the references/ folder:
references/schema-design.md- normalization, indexing strategies, migration patternsreferences/scalable-systems.md- distributed patterns, caching, queues, load balancingreferences/observability.md- logging, metrics, tracing, SLOs, alerting setupreferences/performance.md- profiling, query optimization, connection pooling, asyncreferences/security.md- auth flows, encryption, OWASP top 10, secrets managementreferences/api-design.md- REST/GraphQL/gRPC conventions, versioning, paginationreferences/failure-patterns.md- circuit breakers, retries, idempotency, sagas
Only load a references file if the current task requires it - they are long and will consume context.
References
api-design.md
API Design Reference
Opinionated defaults for mid-level backend engineers. When in doubt, pick the simpler option.
1. REST Conventions
Resource Naming
| Do | Don't |
|---|---|
GET /users |
GET /getUsers |
GET /users/42 |
GET /user/42 |
POST /users |
POST /createUser |
GET /users/42/orders |
GET /getUserOrders?userId=42 |
Rules: Plural nouns always. No verbs in URLs. Lowercase, hyphen-separated (/order-items). Max two levels of nesting - flatten beyond that.
HTTP Method Semantics
| Method | Semantics | Idempotent | Safe | Body |
|---|---|---|---|---|
GET |
Read | Yes | Yes | No |
POST |
Create / trigger action | No | No | Yes |
PUT |
Full replace | Yes | No | Yes |
PATCH |
Partial update | Yes* | No | Yes |
DELETE |
Remove | Yes | No | No |
Default: Use PATCH for updates, not PUT. Clients rarely have the full resource.
Status Codes
Success: 200 GET/PATCH ok | 201 POST created (+ Location header) | 202 async accepted | 204 DELETE ok, no body
Client errors:
| Code | Use | Common Mistake |
|---|---|---|
400 |
Malformed JSON, missing required field | - |
401 |
No auth / expired token ("I don't know who you are") | Using 403 |
403 |
Authenticated but no permission ("I know you, and no") | Using 401 |
404 |
Resource doesn't exist | - |
409 |
Duplicate creation, version conflict | - |
410 |
Permanently deleted (was here, now gone) | Using 404 |
422 |
Valid JSON but fails business rules | Using 400 |
429 |
Rate limited (always include Retry-After) |
Forgetting header |
400 vs 422: 400 = structural (bad JSON). 422 = business logic (email taken).
Nested vs Flat
Default: Flat with filters. Nest only when child can't exist without parent.
GET /users/42/orders # OK - order belongs to user
GET /orders?user_id=42 # Better for most casesHATEOAS
Skip it. Exception: public APIs for third parties where discoverability matters.
2. GraphQL
When It Makes Sense
| Good Fit | Poor Fit |
|---|---|
| Mobile clients with varied data needs | Simple CRUD |
| BFF pattern, multiple frontend teams | Server-to-server (use gRPC) |
| Reducing over-fetching on slow networks | File uploads, HTTP caching needed |
Schema Design
Use Relay connection spec for lists. Non-nullable by default. Prefix mutations: createUser, updateUser. Return mutated object from mutations.
type User {
id: ID!
email: String!
orders(first: Int, after: String): OrderConnection!
}
type OrderConnection { edges: [OrderEdge!]!; pageInfo: PageInfo! }
type OrderEdge { cursor: String!; node: Order! }
type PageInfo { hasNextPage: Boolean!; endCursor: String }N+1 and DataLoader
# BAD - called once per user = N+1
resolve User.orders(user):
return db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
# GOOD - DataLoader batches all keys into one query
resolve User.orders(user):
return orderLoader.load(user.id)
batch_load_orders(user_ids):
return db.query("SELECT * FROM orders WHERE user_id IN (?)", user_ids)Query Complexity Limits
Assign cost per field. Reject queries exceeding max complexity (1000) or max depth (10). Use persisted queries (APQ) in production - prevents arbitrary execution, smaller payloads, auditable.
3. gRPC
When gRPC Shines
| Strength | Why |
|---|---|
| Internal service-to-service | Strict protobuf contracts, fast serialization |
| High throughput / low latency | Binary protocol, HTTP/2 multiplexing |
| Polyglot environments | Auto-generated clients for any language |
| Streaming data | First-class bidirectional streaming |
Protobuf Schema Design
syntax = "proto3";
package orders.v1;
service OrderService {
rpc GetOrder(GetOrderRequest) returns (GetOrderResponse);
rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse);
rpc StreamOrderUpdates(StreamOrderUpdatesRequest) returns (stream OrderUpdate);
}
message GetOrderRequest { string order_id = 1; }
message GetOrderResponse { Order order = 1; }
message Order {
string id = 1; string user_id = 2; OrderStatus status = 3;
repeated OrderItem items = 4; google.protobuf.Timestamp created_at = 5;
}
enum OrderStatus {
ORDER_STATUS_UNSPECIFIED = 0; ORDER_STATUS_PENDING = 1;
ORDER_STATUS_CONFIRMED = 2; ORDER_STATUS_SHIPPED = 3;
}Rules: Always wrap in request/response messages. UNSPECIFIED = 0 first enum. Package as <service>.v1. Use google.protobuf.Timestamp.
Unary vs Streaming
Default: Unary. Server streaming for live feeds/large results. Client streaming for uploads/batching. Bidirectional for chat/collab. Only stream when you need it.
Error Model
| gRPC Code | HTTP | Use |
|---|---|---|
INVALID_ARGUMENT |
400 | Bad input |
NOT_FOUND |
404 | Missing resource |
ALREADY_EXISTS |
409 | Duplicate |
PERMISSION_DENIED |
403 | Forbidden |
UNAUTHENTICATED |
401 | No/bad auth |
RESOURCE_EXHAUSTED |
429 | Rate limited |
INTERNAL |
500 | Server bug |
UNAVAILABLE |
503 | Temporary, retry |
Browser Clients
gRPC doesn't work natively in browsers. Use gRPC-Web proxy (Envoy) or Connect protocol (Buf). Or just use REST/GraphQL for frontend, gRPC internally.
4. Pagination
Cursor vs Offset Decision
| Factor | Cursor-Based | Offset-Based |
|---|---|---|
| Large dataset perf | O(1) seek | O(n) scan |
| Consistency under writes | Stable | Items shift |
| Random page access | No | Yes (?page=5) |
| Default for APIs | Yes | Admin UIs only |
Cursor Implementation
# Cursor = base64(json({"id": last_id, "sort": sort_value}))
GET /orders?limit=20&cursor=eyJpZCI6NDIsInNvcnQiOiIyMDI0LTAxLTE1In0=
# Server:
decode cursor -> {id: 42, sort: "2024-01-15"}
SELECT * FROM orders WHERE (created_at, id) < ('2024-01-15', 42)
ORDER BY created_at DESC, id DESC LIMIT 21; -- +1 to check hasNextPageAlways include primary key in cursor alongside sort key to handle ties.
Page Size & Response
Enforce server-side max: limit = min(request.limit || 20, 100)
{ "data": [...], "pagination": { "next_cursor": "...", "has_next_page": true, "page_size": 20 } }5. Versioning
Strategy
| Strategy | Pros | Cons | Use When |
|---|---|---|---|
URL path /v1/ |
Obvious, easy routing/caching | URL pollution | Default |
Header Accept-Version |
Clean URLs | Hidden, hard to debug | Internal w/ sophisticated clients |
| Query param | Easy browser testing | Caching issues | Almost never |
Breaking vs Non-Breaking
Safe: Add response field, add optional param, add endpoint, widen validation. Breaking (new version): Remove/rename field, change type, change URL, tighten validation.
Sunset Process
HTTP/1.1 200 OK
Deprecation: true
Sunset: Sat, 01 Mar 2025 00:00:00 GMT
Link: <https://api.example.com/v2/users>; rel="successor-version"External APIs: 6+ months notice. Internal: 1-3 months. Never support more than 2 active versions.
Multiple Versions
Route at controller layer, share service/domain logic. V1 and V2 controllers call same service, apply different serializers.
6. Rate Limiting
Algorithms
| Algorithm | Behavior | Pros | Cons |
|---|---|---|---|
| Token Bucket | Fills at steady rate, request takes token | Allows bursts | Slightly complex |
| Sliding Window | Rolling time window count | Accurate | Memory cost |
| Fixed Window | Count per interval | Simple | 2x spike at boundaries |
Default: Token bucket. Most gateways (Kong, Envoy) use it.
Headers and 429 Response
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1710432000
# On 429:
Retry-After: 30 # ALWAYS include this
Content-Type: application/problem+json
{"type":"https://api.example.com/errors/rate-limited","title":"Rate limit exceeded","status":429}Scopes and Tiers
| Scope | Use Case |
|---|---|
| Per API key / user | Default, fairest |
| Per IP | Unauthenticated endpoints (login, signup) |
| Per endpoint | Protect expensive operations |
| Global | Infrastructure protection |
Tier limits in API key/user record. Enforce at gateway, not application code. Example: free=100/hr, pro=5000/hr, enterprise=50000/hr.
7. API Design Checklist
Before Shipping Any Endpoint
- Resource names are plural nouns, lowercase, hyphen-separated
- Consistent field naming (
snake_caseorcamelCase- pick one, stick with it) - Dates in ISO 8601 UTC (
2024-01-15T09:30:00Z) - Errors follow RFC 7807 format with machine-readable
codeper field - 5xx errors never expose internals (no stack traces, SQL, IPs)
- All list endpoints paginated with enforced max page size
- Filtering:
?status=shipped&created_after=2024-01-01 - Sorting:
?sort=-created_at,+total(prefix for direction) - Field selection:
?fields=id,name,email - Bulk endpoints cap at 100 items, return per-item status (207)
- POST endpoints accept
Idempotency-Keyheader (UUIDv4, stored 24h) - Auth on all endpoints except health checks
- Authorization at resource level, not just endpoint level
- Rate limiting configured per scope
- Request body size limits enforced
- No secrets in URLs - use headers or body
- OpenAPI spec auto-generated from code
- Every endpoint has request/response examples documented
RFC 7807 Error Format
{"type":"https://api.example.com/errors/validation-error","title":"Validation Error",
"status":422,"detail":"Email is invalid.","errors":[{"field":"email","code":"invalid_format"}]}Quick Decision Defaults
| Decision | Default |
|---|---|
| Frontend API style | REST (GraphQL if multiple clients with varied needs) |
| Service-to-service | gRPC |
| Pagination | Cursor-based |
| Versioning | URL path /v1/ |
| Rate limiting | Token bucket |
| Error format | RFC 7807 |
| Field naming | snake_case for JSON |
| Timestamps | ISO 8601, always UTC |
| IDs | UUIDs (not auto-increment in public APIs) |
| Auth | Bearer token in Authorization header |
failure-patterns.md
Failure Handling Patterns
Distributed systems fail in partial, unpredictable ways. These patterns help services degrade gracefully instead of cascading failures.
1. Retry Patterns
| Strategy | Delay Formula | Best For |
|---|---|---|
| Simple retry | Fixed delay (e.g., 1s) | Quick transient blips |
| Exponential backoff | base * 2^attempt |
Sustained outages |
| Backoff + jitter | random(0, base * 2^attempt) |
Production default |
Retry (transient): 503, 429 (with backoff), connection reset, timeout, 502 Do NOT retry (permanent): 400, 401/403, 404, 409, 422
Retry budgets: Cap retries at 10-20% of total requests to prevent retry amplification.
def retry_with_backoff(op, max_retries=3, base_ms=100, max_ms=10000):
for attempt in range(max_retries + 1):
try:
return op()
except RetryableError as e:
if attempt == max_retries:
raise e
delay = min(base_ms * (2 ** attempt), max_ms)
sleep(random_between(0, delay)) # Full jitter2. Circuit Breaker
Stops calling a dependency known to be failing. Three states:
CLOSED --[failure threshold]--> OPEN --[reset timeout]--> HALF-OPEN
HALF-OPEN --[trial succeeds]--> CLOSED | --[trial fails]--> OPEN| Parameter | Typical Default |
|---|---|
| Failure threshold | 5 failures in 60s or 50% failure rate |
| Reset timeout | 30s |
| Half-open trial count | 3 |
When to use: External APIs, databases under load, any remote dependency with expected failures. NOT for local in-process calls.
class CircuitBreaker:
def __init__(self, threshold=5, reset_ms=30000):
self.state, self.failures, self.last_fail = CLOSED, 0, None
def call(self, op):
if self.state == OPEN:
if now() - self.last_fail > self.reset_ms:
self.state = HALF_OPEN
else:
raise CircuitOpenError("Failing fast")
try:
result = op()
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == HALF_OPEN: self.state = CLOSED
self.failures = 0
def _on_failure(self):
self.failures += 1
self.last_fail = now()
if self.failures >= self.threshold: self.state = OPENCircuit breaker + retry are complementary: Retries handle transient blips; circuit breaker handles sustained outages. Retries run inside the circuit breaker: Request -> CB -> Retry -> Downstream.
3. Idempotency
Retries without idempotency cause duplicate side effects (e.g., double charges).
| Type | Description | Example |
|---|---|---|
| Natural | Inherently idempotent | SET balance = 100, DELETE WHERE id = X |
| Artificial | Made idempotent via keys | CHARGE $50 with key txn-abc-123 |
Prefer natural idempotency (absolute values over deltas). Use idempotency keys when the operation is inherently non-idempotent.
def process_payment(request):
key = request.headers["Idempotency-Key"] # Client-generated UUID
existing = db.query("SELECT result FROM idempotency_store WHERE key = ?", key)
if existing:
return existing.result # Return cached response
result = charge_payment(request.amount)
db.execute("INSERT INTO idempotency_store (key, result, created_at) VALUES (?, ?, now())", key, result)
return resultDatabase-level: Use INSERT ... ON CONFLICT DO NOTHING with idempotency key as primary/unique key.
TTL: Expire idempotency keys after 24-72 hours to prevent unbounded storage growth.
4. Timeouts
| Type | Controls | Typical Default |
|---|---|---|
| Connect timeout | TCP connection establishment | 1-3s |
| Read timeout | Waiting for response data | 5-30s |
| Total timeout | End-to-end including retries | 30-60s |
Without timeouts: Slow dependency blocks threads, pool exhausts, your service stops responding, callers cascade fail. A single missing timeout can take down an entire service mesh.
Deadline Propagation
In chains (A -> B -> C), propagate a shrinking deadline so downstream knows time remaining:
def handle_request(request):
deadline = request.header("X-Deadline") or (now() + DEFAULT_TIMEOUT)
remaining = deadline - now()
if remaining <= 0:
return error(504, "Deadline exceeded")
downstream_timeout = remaining - LOCAL_BUFFER_MS
return call_service_b(timeout=downstream_timeout, headers={"X-Deadline": deadline})| Scenario | Suggested Timeout |
|---|---|
| Internal service RPC | 3-5s |
| Database query | 5-10s |
| External third-party API | 10-30s |
| User-facing API total | 10-30s |
Always set timeouts explicitly. Never rely on library defaults.
5. Bulkhead Pattern
Isolates failures so one degraded dependency does not exhaust shared resources.
# BAD: Shared pool - slow payment service blocks all 100 threads
executor = ThreadPoolExecutor(max_workers=100)
# GOOD: Isolated pools per dependency
payment_pool = ThreadPoolExecutor(max_workers=30)
email_pool = ThreadPoolExecutor(max_workers=20)
inventory_pool = ThreadPoolExecutor(max_workers=30)Pool sizing: target_throughput * avg_latency_seconds + buffer (e.g., 100 req/s at 200ms = 30 threads).
Also apply to DB connection pools: separate pools for critical (order processing) vs non-critical (analytics) paths.
6. Saga Pattern
Manages distributed transactions across services where ACID is not possible.
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Services react to events | Central coordinator directs |
| Best for | Simple sagas (2-4 steps) | Complex sagas (5+ steps) |
| Visibility | Hard to trace | Easy to see full flow |
Use when: Multi-service business processes, cross-database operations. NOT for single-database operations.
class OrderSaga:
steps = [
SagaStep(action=reserve_inventory, compensate=release_inventory),
SagaStep(action=charge_payment, compensate=refund_payment),
SagaStep(action=schedule_shipping, compensate=cancel_shipping),
]
def execute(self, order):
completed = []
for step in self.steps:
try:
step.action(order)
completed.append(step)
except Exception as e:
for s in reversed(completed): # Compensate in reverse
try:
s.compensate(order)
except Exception:
alert_ops_team(order, s) # Manual intervention needed
raise SagaFailed(e)Key rules: Compensations must be idempotent. Persist saga state so it survives coordinator crashes.
7. Outbox Pattern
Solves the dual-write problem: updating a database AND publishing an event atomically.
def create_order(order):
with db.transaction(): # Same transaction guarantees consistency
db.execute("INSERT INTO orders (id, status) VALUES (?, ?)", order.id, 'created')
db.execute("INSERT INTO outbox (id, event_type, payload) VALUES (?, ?, ?)",
uuid4(), 'OrderCreated', serialize(order))Publishing approaches:
| Approach | How | Trade-off |
|---|---|---|
| Polling publisher | Query outbox for unpublished rows | Simple; adds DB load |
| CDC (e.g., Debezium) | Read DB transaction log | Near real-time; more infra |
def publish_outbox(): # Polling publisher
events = db.query("SELECT * FROM outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 100")
for event in events:
message_broker.publish(event.event_type, event.payload)
db.execute("UPDATE outbox SET published_at = now() WHERE id = ?", event.id)Provides at-least-once delivery - consumers must be idempotent.
8. Graceful Degradation
Shed non-critical work under pressure to preserve core functionality.
| Pressure Level | Response |
|---|---|
| Normal | Full functionality |
| Elevated | Disable recommendations, analytics |
| High | Return cached/fallback responses |
| Critical | Read-only mode, reject new work |
def get_product(product_id):
product = product_service.get(product_id) # Always execute
if feature_flags.enabled("recommendations"):
product.recs = recommendation_service.get(product_id)
else:
product.recs = [] # Fallback
return product
# Load shedding
def handle(request):
if active_requests.get() >= MAX_CONCURRENT:
return Response(503, headers={"Retry-After": "5"})
active_requests.increment()
try:
return process(request)
finally:
active_requests.decrement()Fallback chain: Try live service -> Try cache -> Return generic default.
Pattern Selection Guide
| Problem | Primary Pattern | Complementary |
|---|---|---|
| Transient network errors | Retry + backoff | Idempotency, timeouts |
| Dependency frequently down | Circuit breaker | Retry (inside CB), fallbacks |
| Duplicate side effects | Idempotency keys | DB unique constraints |
| Slow dependency hangs service | Timeouts | Circuit breaker, bulkhead |
| One bad dep takes down all | Bulkhead | Circuit breaker, timeouts |
| Multi-service transaction | Saga | Idempotency (compensations) |
| Reliable event publishing | Outbox pattern | Idempotency (consumers) |
| System overloaded | Graceful degradation | Load shedding, queue leveling |
observability.md
Observability Reference
Observability is not monitoring. Monitoring tells you when something is broken. Observability tells you why. The three pillars - logs, metrics, traces - are useless in isolation. They work together. Invest in correlation between them from day one.
1. Structured Logging
Use JSON Logs - Always
Plain text logs are for humans staring at a terminal. JSON logs are for systems that need to search, filter, and aggregate across millions of entries. Every service should emit structured JSON.
{
"timestamp": "2025-03-14T10:23:45.123Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"request_id": "req-789",
"user_id": "u-42",
"method": "POST",
"path": "/api/v1/charges",
"status": 500,
"duration_ms": 234,
"error": "upstream timeout from billing-gateway",
"message": "Failed to process charge"
}Log Levels
| Level | When to Use | Example |
|---|---|---|
TRACE |
Fine-grained debugging, never in production | Cache key lookup details |
DEBUG |
Development diagnostics, off by default in prod | SQL query parameters |
INFO |
Normal operations worth recording | Request completed, job started/finished |
WARN |
Recoverable issues that need attention | Retry succeeded, fallback used, pool low |
ERROR |
Failures that need investigation | Request failed, dependency timeout |
FATAL |
Process cannot continue | DB connection lost, config missing |
Opinion: Default to INFO in production. If you are logging at DEBUG in production to diagnose an issue, you have a tracing or metrics gap.
Correlation IDs
Every request entering your system gets a correlation/trace ID. Propagate it through every service call.
Client -> API Gateway (generates trace_id: "abc123")
-> Service A (logs with trace_id: "abc123", span_id: "span-1")
-> Service B (logs with trace_id: "abc123", span_id: "span-2")
-> Service C (logs with trace_id: "abc123", span_id: "span-3")Propagation checklist:
- Extract
X-Request-IDortraceparentheader on ingress - Generate one if missing (at the edge only)
- Attach to thread-local / async context
- Include in every log line automatically (middleware, not manual)
- Forward in all outbound HTTP/gRPC/message headers
What to Log vs What NOT to Log
| Log This | Never Log This |
|---|---|
| Request ID, trace ID | Passwords, tokens, API keys |
| User ID (pseudonymized if needed) | Credit card numbers, SSNs |
| HTTP method, path, status code | Full request/response bodies |
| Duration in milliseconds | PII (email, phone, address) |
| Error messages and codes | Session tokens or auth headers |
| Upstream service and latency | Database connection strings |
| Queue depth, retry count | Internal IP addresses (in public logs) |
Opinion: If you log full request bodies "for debugging," you will eventually leak PII into your log aggregator. Log a content hash or size instead.
Log Aggregation Patterns
- Ship logs via a sidecar or agent (Fluentd, Vector, Filebeat) - not directly from your app
- Use a structured pipeline: App -> Agent -> Buffer (Kafka) -> Aggregator (Elasticsearch/Loki)
- Set retention policies per log level: ERROR=90d, WARN=30d, INFO=14d
- Index on:
trace_id,service,level,status,user_id
2. Metrics
RED Method (for Services)
Use RED for any service that handles requests:
| Signal | What It Measures | Metric Type | Example Metric Name |
|---|---|---|---|
| Rate | Requests per second | Counter | http_requests_total |
| Errors | Failed requests per second | Counter | http_requests_errors_total |
| Duration | Time per request | Histogram | http_request_duration_seconds |
USE Method (for Resources)
Use USE for infrastructure - CPU, memory, disk, connections:
| Signal | What It Measures | Example |
|---|---|---|
| Utilization | % of resource capacity in use | CPU at 72%, memory at 85% |
| Saturation | Work waiting (queue depth) | 14 threads waiting for DB connection |
| Errors | Resource-level error events | Disk I/O errors, network drops |
Metric Types
| Type | Use When | Example |
|---|---|---|
| Counter | Value only goes up (resets on restart) | Total requests, total errors |
| Gauge | Value goes up and down | Current connections, queue depth |
| Histogram | You need distribution (p50, p95, p99) | Request latency, response size |
| Summary | Like histogram but calculated client-side | Avoid - prefer histograms for flexibility |
Opinion: Default to histograms for latency. If you use averages for latency, you are hiding your tail - p99 problems are invisible in averages.
Naming Conventions
<namespace>_<subsystem>_<name>_<unit>
payment_service_http_request_duration_seconds # Good
payment_service_orders_processed_total # Good
requestTime # Bad: no namespace, no unit
PaymentService.http # Bad: dots and camelCaseRules:
- Use snake_case
- Include the unit as a suffix (
_seconds,_bytes,_total) - Use
_totalsuffix for counters - Prefix with service/namespace
- Use base units (seconds not milliseconds, bytes not megabytes)
Cardinality Traps
High-cardinality labels will kill your metrics backend. Every unique label combination creates a new time series.
| Label | Cardinality | Safe? |
|---|---|---|
method |
~5 | Yes |
status_code |
~5 (grouped) | Yes |
endpoint |
~50 | Usually |
customer_tier |
~4 | Yes |
user_id |
Millions | NO |
request_id |
Infinite | NO |
email |
Millions | NO |
Rule of thumb: If a label has more than ~100 unique values, do not put it on a metric. Use logs or traces for high-cardinality data.
3. Distributed Tracing
Core Concepts
Trace (trace_id: "abc123")
+-- Span A: API Gateway (span_id: "s1", 245ms)
| +-- Span B: Auth Service (span_id: "s2", parent: "s1", 12ms)
| +-- Span C: Order Service (span_id: "s3", parent: "s1", 220ms)
| +-- Span D: DB Query (span_id: "s4", parent: "s3", 45ms)
| +-- Span E: Cache Lookup (span_id: "s5", parent: "s3", 3ms)- Trace: The full journey of a request across all services
- Span: A single unit of work within a trace (one service call, one DB query)
- Context propagation: Passing trace_id + span_id between services via headers (
traceparent)
OpenTelemetry Basics
OpenTelemetry (OTel) is the standard. Use it.
tracer = OpenTelemetry.getTracer("payment-service")
function processOrder(order):
span = tracer.startSpan("process-order")
span.setAttribute("order.id", order.id)
try:
result = chargePayment(order)
span.setStatus(OK)
return result
catch error:
span.setStatus(ERROR)
span.recordException(error)
raise
finally:
span.end()Prefer auto-instrumentation for HTTP/gRPC/DB clients. Add manual spans for business logic.
Sampling Strategies
You cannot trace 100% of production traffic. Sampling is required.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Head-based | Decide at ingress whether to sample | Simple; misses interesting traces |
| Tail-based | Collect all spans, decide after trace completes | Catches errors and slow traces; costly |
| Rate-limited | Sample N traces per second | Predictable cost; biased sample |
Opinion: Start with head-based sampling at 5-10%. Add tail-based sampling for errors and high-latency traces. Tail-based is harder to operate but catches the traces you actually need.
When Tracing Matters Most
- Multi-service request chains (3+ hops)
- Async workflows (message queues, event-driven flows)
- Diagnosing latency - "where is the time spent?"
- Debugging fan-out/fan-in patterns
- Verifying retry and fallback behavior
Tracing is less useful for single-service, synchronous request-response. Logs and metrics are enough there.
4. SLOs and SLIs
Defining SLIs
An SLI (Service Level Indicator) is a measurement of your service's behavior that users care about.
| SLI Type | Definition | Measurement |
|---|---|---|
| Availability | % of requests that succeed | successful_requests / total_requests |
| Latency | % of requests faster than threshold | requests < 300ms / total_requests |
| Throughput | Requests processed per time window | requests_per_second |
| Correctness | % of requests returning correct results | correct_responses / total_responses |
Setting Realistic SLOs
An SLO (Service Level Objective) is a target for your SLI, measured over a time window.
SLO: 99.9% of requests succeed within 300ms, 30-day rolling window
Error budget: At 100 rps -> 259M requests/month -> 0.1% = 259,200 failures allowed| SLO Target | Downtime per Month | Realistic For |
|---|---|---|
| 99% | 7.3 hours | Internal tools, batch systems |
| 99.9% | 43.8 minutes | Most production APIs |
| 99.95% | 21.9 minutes | Customer-facing critical paths |
| 99.99% | 4.4 minutes | Payment processing, auth services |
| 99.999% | 26.3 seconds | Almost nobody needs this |
Opinion: Start with 99.9%. If you have not operated a service at 99.9% successfully, do not set a 99.99% target - you will just create noise.
Error Budgets and Burn Rate
- Error budget = 1 - SLO target. For 99.9% SLO, budget is 0.1%
- Burn rate = how fast you consume budget. Rate of 1 = expected pace
Burn rate = error_rate_observed / error_rate_allowed
Example: 0.5% observed / 0.1% allowed = 5x burn -> 30-day budget gone in 6 daysSLO-Based vs Threshold-Based Alerting
| Approach | Alert When | Problem |
|---|---|---|
| Threshold | Error rate > 1% for 5 minutes | Too many false alarms or missed issues |
| SLO-based | Burn rate threatens to exhaust budget | Smarter - accounts for sustained impact |
SLO-based alerting (recommended):
- Fast burn (14.4x): Alert in 2 min, page immediately - major incident
- Slow burn (3x): Alert in 1 hour, create ticket - degradation trend
Example SLO for a Web Service
service: order-api
slos:
- name: availability
sli: successful_responses / total_responses # (non-5xx)
target: 99.9%
window: 30d
- name: latency
sli: responses_under_300ms / total_responses
target: 99.0%
window: 30d
- name: latency-critical
sli: responses_under_1000ms / total_responses
target: 99.9%
window: 30d5. Alerting
Alert Fatigue - Causes and Prevention
| Cause | Fix |
|---|---|
| Alerts on symptoms, not impact | Alert on SLO burn rate, not CPU spikes |
| No ownership | Every alert has a team owner |
| Duplicate alerts | Deduplicate and group related alerts |
| Flapping alerts | Add hysteresis (alert on sustained condition) |
| Alerts nobody acts on | Delete them. If nobody investigates, it is noise |
Opinion: If your on-call gets paged more than twice a week outside business hours, your alerts are broken. Fix the alerts, not the humans.
Actionable Alerts Checklist
Every alert MUST have:
- What - clear description of the symptom
- Impact - who/what is affected and how severely
- Runbook link - step-by-step investigation/mitigation guide
- Owner - which team owns this alert
- Severity - page vs ticket vs notification
- Dashboard link - relevant dashboard for context
- Expected behavior - what "normal" looks like for comparison
If an alert does not have a runbook, it is not ready for production.
Severity Levels
| Severity | Response | Example | SLA |
|---|---|---|---|
| P1 | Page immediately | Service down, data loss, security breach | 15 min ack |
| P2 | Page during hours | Degraded performance, partial outage | 1 hour ack |
| P3 | Ticket, next day | Elevated error rate, slow burn on budget | Next biz day |
| P4 | Ticket, backlog | Non-critical warning, capacity planning | 1 week |
Escalation Patterns
T+0 min -> Primary on-call paged
T+15 min -> No ack? Secondary on-call paged
T+30 min -> No ack? Engineering manager paged
T+60 min -> No resolution? Incident commander engagedOn-Call Best Practices
- Rotate weekly, never longer. Handoff meetings at each rotation
- Track on-call burden: pages per shift, time-to-ack, hours spent
- Blameless postmortems for every P1 and repeated P2
- Budget 30% of sprint capacity for on-call follow-up work
6. Dashboards
The Four Golden Signals Dashboard
Every service gets a dashboard with these four panels:
| Signal | Metric | Visualization |
|---|---|---|
| Latency | p50, p95, p99 request duration | Time series graph |
| Traffic | Requests per second | Time series graph |
| Errors | Error rate (5xx / total) | Time series graph |
| Saturation | CPU, memory, connection pool usage | Gauge + graph |
Service Dashboard Template
Row 1: [SLO Status] [Error Budget Remaining] [Apdex Score]
Row 2: [Request Rate] [Error Rate by Type] [Latency p50/p95/p99]
Row 3: [Upstream Latency by Service] [DB Query Duration] [Cache Hit Rate]
Row 4: [CPU/Memory] [Connection Pools] [Queue Depth]
Row 5: [Orders/min] [Revenue Impact] [Active Users]Dashboard Anti-Patterns
| Anti-Pattern | Why It Hurts | Fix |
|---|---|---|
| 30+ panels on one dashboard | Information overload, nothing stands out | 8-12 panels max, link to details |
| Vanity metrics (total users) | Feels good, reveals nothing about health | Show rate-of-change, not totals |
| No time range context | Spikes look alarming without baseline | Add week-over-week overlay |
| Dashboard per person | Inconsistent views, duplicated work | One canonical dashboard per service |
| No annotations | Deploys and incidents invisible on graphs | Add deploy markers and incident bands |
Quick Reference: Connecting the Three Pillars
"Orders are slow" ->
1. Dashboard: p99 latency spiked at 14:32 (METRICS)
2. Filter by endpoint: POST /orders is slow (METRICS)
3. Find traces for slow /orders requests (TRACES)
4. Trace shows 2s in inventory-service (TRACES)
5. Search logs for inventory-service with trace_id (LOGS)
6. Log: "connection pool exhausted, waited 1.8s" (LOGS)
-> Root cause: DB pool too small after traffic increaseMetrics tell you something is wrong. Traces tell you where. Logs tell you why.
performance.md
Performance Reference
Practical guide for diagnosing and fixing backend performance problems. Assumes you have basic observability in place. If not, set that up first - you cannot optimize what you cannot measure.
1. Performance Diagnosis Workflow
Follow this order. Do not skip to profiling code before ruling out infrastructure.
1. CHECK METRICS (dashboards, APM)
|
2. IDENTIFY BOTTLENECK CATEGORY
+-- CPU bound? -> high CPU %, low I/O wait
+-- Memory bound? -> high RSS, swap usage, OOM kills
+-- I/O bound? -> high iowait %, disk queue depth
+-- Network bound? -> high latency to dependencies, packet loss
|
3. DRILL DOWN into the specific category (see sections below)
|
4. REPRODUCE under controlled load -> FIX -> VERIFY under same loadFlame graphs answer "where is CPU time spent?" Read bottom-up. Wide bars at the
top = leaf functions consuming time. Generate with perf (Linux), async-profiler
(JVM), or py-spy (Python).
APM checklist:
- RED metrics visible (Rate, Errors, Duration) per endpoint
- Slow transaction traces enabled (threshold: 2x your p95)
- Database query tracking with EXPLAIN capture
- External call latency tracked per dependency
Load testing phases:
| Phase | What to test | Tool examples |
|---|---|---|
| Smoke | Works at 1-5 users? | curl, httpie |
| Load | Expected traffic (1x baseline) | k6, Locust, wrk |
| Stress | 2-5x traffic, find breaking point | k6, Gatling |
| Soak | Sustained hours, find leaks | k6 long duration |
Always load test against production-like dataset sizes.
2. Query Optimization
Reading EXPLAIN plans
Key fields (PostgreSQL; similar concepts in MySQL):
| EXPLAIN field | Good sign | Bad sign |
|---|---|---|
Seq Scan |
On tiny tables (<1000 rows) | On large tables - missing index |
Index Scan |
Selective index | Index returning >10% of rows |
Nested Loop |
Inner side uses index | Inner side is Seq Scan |
Hash Join |
Fits in memory | Spilling to disk |
Rows (est vs actual) |
Close match | Off by 10x+ - run ANALYZE |
Always use EXPLAIN (ANALYZE, BUFFERS) for actual execution data.
Common anti-patterns
| Anti-pattern | Fix |
|---|---|
SELECT * |
Select only needed columns |
| N+1 queries | JOIN, batch fetch, or dataloader |
| Missing index on WHERE/JOIN/ORDER BY | Add targeted index |
Implicit type cast (WHERE varchar_col = 123) |
Match types explicitly |
LIKE '%term%' |
Full-text search or trigram index |
Function on indexed column (WHERE YEAR(col) = 2024) |
Use range predicate |
DISTINCT hiding a bad JOIN |
Fix the JOIN |
Query rewriting techniques
-- SLOW: correlated subquery runs per row
SELECT o.id FROM orders o
WHERE o.total > (SELECT AVG(total) FROM orders WHERE user_id = o.user_id);
-- FAST: window function, single pass
SELECT id FROM (
SELECT id, total, AVG(total) OVER (PARTITION BY user_id) AS avg_total
FROM orders
) sub WHERE total > avg_total;Batch operations: Replace N individual inserts with multi-row INSERT INTO t (col) VALUES ('a'), ('b'), ('c'). Use prepared statements for any repeated query.
3. Connection Pooling
Why it matters
Opening a connection is expensive: TCP + TLS handshake + auth + server memory. Without pooling: 100 concurrent requests = 100 connections opened/closed. With pooling: 100 concurrent requests = 10-20 reused connections.
Pool sizing formula
connections = (core_count * 2) + spindle_countFor a 4-core server with SSDs: (4 * 2) + 0 = 8 connections.
Most teams set pool size too high. 50-100 per app instance is almost always wrong. More connections = more lock contention and memory on the DB. Start small, measure.
Connection leak detection
Symptoms: pool exhaustion after hours, connection count grows monotonically, works after restart then degrades.
pool.on('acquire', (conn) => {
log.debug('acquired', conn.id, stack_trace())
setTimeout(() => {
if (conn.still_acquired)
log.warn('possible leak', conn.id, conn.acquired_stack_trace)
}, 30_000)
})
pool.on('release', (conn) => log.debug('released', conn.id))PgBouncer modes
| Mode | Behavior | Default choice? |
|---|---|---|
| Transaction pooling | Returned after each transaction | Yes - use this |
| Session pooling | Held for entire session | Only if you need temp tables/prepared stmts |
| Statement pooling | Returned after each statement | Rarely useful |
Pool exhaustion fixes
| Symptom | Fix |
|---|---|
| "pool exhausted" errors | Check for leaks first, then increase size |
| Queries queueing | Add query timeout, move slow queries to replica |
| Idle-in-transaction | Acquire connection only when executing queries |
| Burst errors after deploy | Add connection warmup, set min pool size |
4. Async Patterns
When to use what
| Pattern | Best for | Avoid when |
|---|---|---|
| Event loop (Node, asyncio) | I/O-bound: HTTP, DB, file reads | CPU work blocks the loop |
| Thread pool (Java, Go) | CPU-bound: image processing, parsing | Threads >> cores |
| Hybrid (worker threads) | Mixed workloads | Over-engineering simple I/O services |
The rule: Waiting on external things? Async. Crunching data on CPU? Threads.
Promise/future patterns
// Fan-out (parallel I/O) - total time = max, not sum
results = await Promise.all([fetch_user(id), fetch_orders(id), fetch_prefs(id)])
// Settle all, handle partial failures
results = await Promise.allSettled([task1, task2, task3])
succeeded = results.filter(r => r.status === 'fulfilled')Background job queues
Use when: work >500ms, user doesn't wait, must survive restarts, needs retry logic.
Queue checklist:
- Jobs are idempotent (safe to retry)
- Max retry count with exponential backoff
- Dead letter queue for permanently failed jobs
- Payload contains IDs, not full objects
- Per-job-type processing timeout
Worker pool sizing
// I/O-bound: core_count * (1 + wait_time / compute_time)
// CPU-bound: core_count (or core_count - 1 for headroom)
// Mixed: start at core_count * 2, adjust by measured utilization5. Caching for Performance
Hot path identification
- Sort endpoints by request volume
- Check if response is cacheable (same input = same output)
- Measure per-request cost (DB queries, external calls)
- Cache where
volume * costis highest
Cache hit ratio targets
| Cache type | Target | Below target means |
|---|---|---|
| Hot path (profiles, config) | >95% | Bad key design or TTL too short |
| Search/feeds | >80% | High cardinality, try partial caching |
| Aggregations | >70% | Data changes too fast |
| Sessions | >99% | Store misconfigured |
A cache with <50% hit ratio is added latency and cost. Remove it.
Cache-aside with database fallback
function get_user(user_id):
cached = cache.get("user:" + user_id)
if cached != null: return cached
user = db.query("SELECT ... WHERE id = ?", user_id)
if user == null:
cache.set("user:" + user_id, NULL_SENTINEL, ttl=60) // prevent stampede
return null
cache.set("user:" + user_id, user, ttl=3600)
return user
function update_user(user_id, data):
db.update(...)
cache.delete("user:" + user_id) // DELETE, don't SET - avoids race conditionsAlways invalidate by deleting, not updating. Delete + lazy reload avoids stale-write race conditions.
Materialized views as caching
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT date_trunc('day', created_at) AS day, SUM(amount), COUNT(*)
FROM orders GROUP BY 1;
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue; -- no read lock6. Memory and Resource Management
Memory leak detection
| Symptom | Common cause |
|---|---|
| RSS grows over hours | Unbounded caches, event listener accumulation |
| GC pauses increasing | Objects promoted to old gen |
| OOM kills | Limit too low or actual leak |
Process: Take heap snapshot after warmup. Run load 30min. Take second snapshot. Diff objects - growing counts = likely leak source.
GC tuning basics
| Concern | Lever |
|---|---|
| Frequent minor GCs | Increase young gen size |
| Long major GC pauses | Use concurrent collector (G1GC, ZGC) |
| High GC overhead (>10% CPU) | Reduce allocation rate in code first |
GC tuning is a last resort. Reduce allocation rate by fixing the code first.
Resource limits checklist
- File descriptors - Set to 65535+. Default 1024 is too low. (
ulimit -n) - DB connections - Match
max_connectionsto total pool size across all instances - Memory limits - Leave 10-20% headroom above peak for GC and OS
- TIME_WAIT sockets - Tune
tcp_tw_reusefor high outbound connection rates
7. Latency Budgets
Budget allocation example (200ms total)
API Gateway: 5ms -> Auth: 15ms -> Main Service: 80ms -> Downstream: 50ms
|-> DB: 30ms Network overhead: 40ms
|-> Cache: 5ms Serialization: 10ms
|-> Compute: 20ms- Reserve 20% for network overhead and variance
- No single dependency gets >40% of total budget
- Set timeouts equal to budget allocation at each hop
p50 vs p99
| Percentile | What it tells you |
|---|---|
| p50 | Typical experience - half see this or better |
| p95 | Bad-day experience - 1 in 20 requests |
| p99 | Worst realistic experience - often your biggest users |
Optimize for p99 first. A service with p50=100ms, p99=150ms is better than p50=30ms, p99=3000ms. Consistency beats average speed. At 100 requests per session, every user hits p99 at least once.
Tail latency amplification
When fanning out to N parallel calls, overall latency = slowest response:
| Fan-out | Impact | Mitigation |
|---|---|---|
| 1-3 services | Manageable | Standard retry + timeout |
| 5-10 services | p99 becomes common | Hedged requests (send to 2, take first) |
| 10-50 services | Tail dominates | Deadline propagation, partial results |
| 50+ services | Assume partial failure | Best-effort response, fill in async |
Key mitigations:
- Hedged requests - After p50 latency, send duplicate to another replica
- Deadline propagation - Pass remaining budget downstream
- Partial results - Return incomplete data with completeness indicator
- Request coalescing - Deduplicate identical in-flight downstream calls
scalable-systems.md
Scalable Systems Reference
Opinionated defaults for mid-level backend engineers. Follow the defaults unless you have a measured reason to deviate.
1. Scaling Decision Tree
START: Is your service stateless?
|
+-- NO --> Make it stateless first. Move state to a database/cache/object store.
| Do NOT scale a stateful service horizontally without understanding the cost.
|
+-- YES --> Are you CPU or memory bound?
+-- CPU bound, single-threaded bottleneck --> Scale UP (vertical) first.
+-- Memory bound, dataset fits in one box --> Scale UP (vertical) first.
+-- Already on the largest practical instance --> Scale OUT (horizontal).
+-- Need fault tolerance across zones/regions --> Scale OUT (horizontal).
+-- Traffic is spiky and you need elasticity --> Scale OUT (horizontal).Vertical vs Horizontal - Decision Table
| Signal | Scale UP | Scale OUT |
|---|---|---|
| Single-threaded bottleneck | YES | No |
| Need >99.99% availability | No | YES |
| Spiky/unpredictable traffic | No | YES |
| Simple ops team (< 3 engineers) | YES | No |
| Already on largest instance class | No | YES |
| Stateful workload hard to partition | YES | No |
Default: Start vertical, go horizontal when forced.
Session Affinity Pitfalls
- Session affinity (sticky sessions) makes horizontal scaling a lie - you get uneven load.
- If one node goes down, all its sticky sessions are lost.
- Fix: externalize session state to Redis/database. Make every request routable to any node.
- Only use session affinity as a temporary migration step, never as a permanent architecture.
2. Caching Patterns
Pattern Selection
| Pattern | Read Heavy | Write Heavy | Consistency Need | Complexity |
|---|---|---|---|---|
| Cache-aside | YES | No | Eventual OK | Low |
| Read-through | YES | No | Eventual OK | Medium |
| Write-through | No | YES | Strong needed | Medium |
| Write-behind | No | YES | Eventual OK | High |
Default: Cache-aside. Simplest pattern, works for 80% of use cases.
Cache-Aside (Lazy Loading)
function get(key):
value = cache.get(key)
if value is null:
value = database.get(key)
cache.set(key, value, ttl=300)
return value
function update(key, new_value):
database.update(key, new_value)
cache.delete(key) // DELETE, not SET. Avoids race conditions.Rule: On write, DELETE the cache entry. Do not try to update it. Updating creates race conditions between concurrent writes.
Read-Through
Same as cache-aside but the cache library handles the DB fetch. Use when your cache layer supports it natively. The application only talks to the cache.
Write-Through
function update(key, new_value):
cache.set(key, new_value) // Cache layer writes to DB synchronously before returningUse when you need strong read-after-write consistency AND you read recently written data.
Write-Behind (Write-Back)
function update(key, new_value):
cache.set(key, new_value)
queue_async_write(key, new_value) // Persisted to DB asynchronouslyUse only when you can tolerate data loss on cache failure. Good for analytics counters, non-critical metrics.
TTL Strategy
| Data Type | TTL | Reasoning |
|---|---|---|
| User profile | 5-15 min | Changes infrequently, stale OK briefly |
| Product catalog | 1-5 min | Moderate change rate |
| Session data | 30 min | Matches session timeout |
| Config/feature flags | 30-60 sec | Need fast propagation |
| Computed aggregations | 1-15 min | Expensive to recompute |
Default: 5 minutes. Adjust based on how much staleness your users tolerate.
Cache Invalidation Approaches
- TTL expiry (default) - simplest, handles most cases
- Explicit delete on write - use with cache-aside
- Event-driven invalidation - publish invalidation events on write; subscribers delete keys
- Version keys - append a version number to the cache key; bump version on write
Redis vs Memcached
| Factor | Redis | Memcached |
|---|---|---|
| Data structures | Rich | Key-value |
| Persistence | Yes | No |
| Replication | Yes | No |
| Pub/Sub | Yes | No |
| Memory efficiency | Good | Better |
| Multi-threaded | No (mostly) | Yes |
| Max value size | 512 MB | 1 MB |
Default: Redis. Unless you only need a simple, volatile key-value cache at massive scale with no persistence needs.
Multi-Layer Caching
Request -> L1 (in-process, e.g., Caffeine/Guava) -> L2 (distributed, e.g., Redis) -> Database- L1: In-process. Sub-millisecond. Small capacity (100 MB - 1 GB). Short TTL (30-60 sec).
- L2: Distributed. Single-digit milliseconds. Large capacity. Longer TTL.
Use L1 when you have a hot key problem or read latency matters at p99. Skip L1 when data changes frequently and consistency matters, or your service has many instances (each L1 diverges).
Cache Stampede Prevention
When a popular key expires, hundreds of requests hit the database simultaneously.
Solution 1 - Locking (default):
function get_with_lock(key):
value = cache.get(key)
if value is null:
if acquire_lock(key, timeout=5s):
value = database.get(key)
cache.set(key, value, ttl=300)
release_lock(key)
else:
sleep(50ms)
return get_with_lock(key) // Retry, lock holder populates cache
return valueSolution 2 - Probabilistic early expiry:
function get_with_early_recompute(key):
value, expiry = cache.get_with_expiry(key)
ttl_remaining = expiry - now()
if value is not null AND random() < exp(-ttl_remaining / beta):
value = database.get(key)
cache.set(key, value, ttl=300)
return valueUse locking for most cases. Use probabilistic early expiry for extremely hot keys where lock contention itself becomes a bottleneck.
3. Message Queues and Event-Driven Architecture
When to Use a Queue vs Direct Call
| Scenario | Direct Call | Queue |
|---|---|---|
| Need synchronous response | YES | No |
| Caller does not need to know the result | No | YES |
| Downstream is unreliable or slow | No | YES |
| Need to fan out to multiple consumers | No | YES |
| Processing can be deferred | No | YES |
| Request-response within 100ms SLA | YES | No |
| Need retry with backoff | No | YES |
Default: Direct call for queries, queue for commands/events that don't need an immediate response.
Pub/Sub vs Point-to-Point
- Point-to-point (queue): One message, one consumer. Use for task distribution.
- Pub/Sub (topic): One message, many consumers. Use for event notification (e.g., order placed - notify shipping AND billing AND analytics).
Delivery Semantics
| Semantic | Guarantee | Use When |
|---|---|---|
| At-most-once | May lose messages | Metrics, logs (acceptable loss) |
| At-least-once | May duplicate messages | Most business logic (DEFAULT) |
| Exactly-once | No loss, no duplicates | Financial transactions |
Default: At-least-once with idempotent consumers. Exactly-once is expensive and often an illusion - make your consumers idempotent instead.
function handle_message(msg):
if already_processed(msg.idempotency_key):
ack(msg)
return
process(msg)
mark_processed(msg.idempotency_key)
ack(msg)Dead Letter Queues (DLQ)
- Always configure a DLQ. Set max retry count to 3-5.
- Monitor DLQ depth. Alert when it grows.
- Build tooling to replay DLQ messages after fixing the bug.
- Never silently drop messages.
Backpressure Handling Checklist
- Set max queue depth. Reject or shed load when full.
- Consumer scales with queue depth (autoscaling).
- Producer has circuit breaker if queue is unavailable.
- Monitor consumer lag. Alert when lag exceeds N minutes.
- Have a plan for falling hours behind (skip, batch, prioritize).
Kafka vs RabbitMQ vs SQS
| Factor | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Throughput | Very high (1M+/sec) | High (50K+/sec) | High (managed) |
| Ordering | Per-partition | Per-queue | Best-effort (FIFO*) |
| Message replay | Yes (retention) | No | No |
| Consumer groups | Native | Manual | No |
| Ops complexity | High | Medium | None (managed) |
| Exactly-once | Yes (with config) | No | FIFO dedup window |
| Best for | Event streaming, | Task queues, | Simple async, |
| event sourcing | RPC patterns | serverless, AWS |
Defaults:
- AWS shop, simple async work - SQS
- Need message replay or event streaming - Kafka
- Complex routing, priority queues, RPC - RabbitMQ
4. Load Balancing
Algorithm Decision Table
| Algorithm | Best For | Avoid When |
|---|---|---|
| Round-robin | Homogeneous backends, stateless | Backends have different capacity |
| Weighted round-robin | Heterogeneous backends | Load varies per request |
| Least connections | Varying request duration | Needs accurate connection counts |
| Consistent hashing | Caching layers, sticky needs | Backends frequently added/removed |
| Random | Simple, surprisingly effective | Need deterministic routing |
Default: Round-robin for stateless services. Least connections when request latency varies significantly.
L4 vs L7 Load Balancing
| Feature | L4 (Transport) | L7 (Application) |
|---|---|---|
| Inspects | TCP/UDP headers | HTTP headers, URL, body |
| Performance | Faster (less parsing) | Slower (full parsing) |
| Routing | IP + port only | Path, header, cookie |
| TLS termination | Pass-through or term | Always terminates |
| Cost | Lower | Higher |
| Use case | TCP services, DBs | HTTP APIs, web apps |
Default: L7 for HTTP services (you almost always want path-based routing, header inspection, and TLS termination).
Health Check Checklist
- Liveness check: is the process alive? (simple HTTP 200)
- Readiness check: can it serve traffic? (DB connected, caches warm)
- Health endpoint checks downstream dependencies with timeouts
- Interval: 10 seconds. Threshold: 3 consecutive failures before removing.
- Health check is lightweight - no expensive DB queries.
Connection Draining
When removing a backend from the pool:
- Stop sending new requests to the node.
- Let in-flight requests complete (drain timeout: 30 seconds).
- If requests still in-flight after timeout, forcibly close.
- Shut down the node.
Never skip connection draining during deployments. It causes user-visible errors.
5. Microservices Patterns
Service Boundary Checklist
- The service owns its data (no shared database with another service).
- The service can be deployed independently.
- The service maps to a bounded context (one business domain).
- The team can make changes without coordinating with other teams.
- It is not so small that every operation requires cross-service calls.
If two services are always deployed together or always change together, merge them.
Sync vs Async Communication
| Factor | Sync (HTTP/gRPC) | Async (Events/Queues) |
|---|---|---|
| Need immediate response | YES | No |
| Temporal coupling acceptable | YES | No |
| Fan-out to many services | No | YES |
| Resilience to downstream fail | Low | High |
| Debugging/tracing difficulty | Low | High |
Default: Sync for queries (reads). Async for commands/events (writes that trigger side effects).
API Gateway Pattern
Place an API gateway in front of your microservices when:
- Clients need a single entry point (mobile, SPA).
- You need cross-cutting concerns: auth, rate limiting, request logging.
- You want to aggregate responses from multiple services.
Do NOT build a custom gateway. Use an existing one (Kong, Envoy, AWS API Gateway).
Service Mesh
Consider a service mesh (Istio, Linkerd) when you have 15+ services and need standardized observability, mTLS, and traffic management. Skip the service mesh when you have fewer than 10 services - the operational overhead is not worth it.
Database Per Service
Non-negotiable. Each service owns its data store. Other services access data through the service's API, never by querying the database directly.
If you need data from another service:
- Call their API (sync).
- Consume their events and maintain a local read model (async, preferred for high-read scenarios).
- Never share a database. It creates hidden coupling that makes independent deployment impossible.
6. CAP Theorem Applied
What CAP Actually Means
- Consistency (C): Every read returns the most recent write.
- Availability (A): Every request receives a response (not an error).
- Partition Tolerance (P): System continues operating despite network partitions between nodes.
Key insight: Network partitions will happen. You do not get to opt out of P. Your real choice is between C and A during a partition.
Why "CA" Does Not Exist
A "CA" system assumes the network never partitions. In a distributed system, this is fantasy. A single-node PostgreSQL is technically "CA" but it is not distributed - it is just a database. The moment you add a second node, you must handle partitions.
CP vs AP - Decision Table
| System Type | During Partition | Example Systems | Use When |
|---|---|---|---|
| CP | Rejects some requests | ZooKeeper, etcd, HBase, MongoDB* | Correctness > availability. |
| to preserve consistency | (with majority write concern) | Financial data, leader election, | |
| distributed locks, inventory counts | |||
| AP | Serves possibly stale | Cassandra, DynamoDB, CouchDB, DNS | Availability > consistency. |
| data to stay available | Shopping carts, social feeds, | ||
| user preferences, analytics |
*MongoDB is configurable - it can behave as CP or AP depending on read/write concern settings.
Real-World Examples
CP - Bank account balance: User withdraws during a partition. CP rejects the request rather than risk an overdraft. Temporary unavailability beats losing money.
AP - Social media feed: During a partition, a 30-second stale feed beats an error page. Users notice downtime but not slight staleness.
Practical Checklist
- Identify: Is this data correctness-critical or availability-critical?
- Correctness-critical (money, inventory, auth) - choose CP.
- Availability-critical (feeds, recommendations, caches) - choose AP.
- Most systems are a mix - different data stores for different needs.
- Do not over-index on CAP. Partitions are rare. Focus on latency vs consistency tradeoffs for the common case.
Quick Reference - Opinionated Defaults
| Decision | Default |
|---|---|
| Scaling strategy | Vertical first, horizontal when forced |
| Caching pattern | Cache-aside with TTL |
| Cache technology | Redis |
| Cache TTL | 5 minutes |
| Cache invalidation | Delete on write + TTL expiry |
| Stampede prevention | Locking |
| Message queue (AWS) | SQS |
| Message queue (streaming) | Kafka |
| Delivery semantics | At-least-once + idempotent consumers |
| Load balancer algorithm | Round-robin (stateless), least-conn (varied) |
| Load balancer layer | L7 for HTTP services |
| Service communication | Sync for reads, async for events |
| Database sharing | Never. Database per service. |
| CAP for money | CP |
| CAP for feeds | AP |
schema-design.md
Database Schema Design Reference
1. Normalization Levels
| Normal Form | Rule | Violation Example |
|---|---|---|
| 1NF | Every column holds atomic (single) values; no repeating groups | tags = "api,auth,billing" in one column |
| 2NF | 1NF + every non-key column depends on the entire primary key | Composite PK (order_id, product_id) with customer_name depending only on order_id |
| 3NF | 2NF + no non-key column depends on another non-key column | zip_code -> city stored alongside other customer fields |
-- 1NF violation: repeating group
CREATE TABLE orders (id SERIAL PRIMARY KEY, items TEXT); -- "widget:3,gadget:1"
-- Fix: separate order_items table with one row per item
-- 2NF violation: partial key dependency
CREATE TABLE order_details (
order_id INT, product_id INT, customer_name TEXT, quantity INT,
PRIMARY KEY (order_id, product_id)
-- customer_name depends only on order_id; move it to orders table
);
-- 3NF violation: transitive dependency
CREATE TABLE customers (
id SERIAL PRIMARY KEY, name TEXT, zip_code TEXT,
city TEXT -- city depends on zip_code, not customer; extract to zip_codes table
);When to Stop Normalizing
Default: normalize to 3NF. Deviate when you have measured evidence.
| Scenario | Recommendation |
|---|---|
| OLTP with mixed reads/writes | 3NF - normalize fully |
| Read-heavy analytics / dashboards | Denormalize into materialized views or reporting tables |
| Hot-path query joining 5+ tables | Denormalized read model alongside the normalized source |
| Write-heavy ingestion (logs, events) | Append-only flat tables; normalize downstream |
Rule of thumb: Normalize the source of truth. Denormalize the read path. Never denormalize your only copy.
2. Indexing Strategies
Index Types
| Index Type | Best For | Avoid When |
|---|---|---|
| B-tree (default) | Range queries, equality, sorting, BETWEEN |
Full-text search, array containment |
| Hash | Exact equality lookups only | Range queries, sorting |
| GIN | Array containment (@>), full-text search, JSONB |
Simple scalar equality; high-write tables |
| GiST | Geometric/spatial data, range types, nearest-neighbor | Simple equality or range on scalars |
Default choice: B-tree. Switch only when the query pattern demands it.
Composite Index Column Ordering
- Leftmost prefix rule - index on
(a, b, c)supports(a),(a, b),(a, b, c), but NOT(b, c)alone. - Equality columns first, then range/sort columns.
-- Query: WHERE tenant_id = ? AND created_at > ? ORDER BY created_at
CREATE INDEX idx_tenant_created ON events (tenant_id, created_at); -- correct
-- NOT: (created_at, tenant_id) -- breaks leftmost prefix for tenant-scoped queriesCovering and Partial Indexes
-- Covering: DB never touches the heap (index-only scan)
CREATE INDEX idx_users_email ON users (email) INCLUDE (name);
-- Partial: index only the rows you actually query
CREATE INDEX idx_orders_pending ON orders (created_at) WHERE status = 'pending';When NOT to Index
| Scenario | Why |
|---|---|
| Column with < 5 distinct values on a large table | Low selectivity; full scan is often faster |
| Write-heavy table with rarely queried columns | Every index slows INSERT/UPDATE/DELETE |
| Small tables (< 10k rows) | Sequential scan is fast enough |
Columns only in SELECT, never in WHERE/JOIN/ORDER BY |
Index won't be used |
Reading EXPLAIN Output
Seq Scan -- full table scan; missing index or low selectivity
Index Scan -- good; using index, fetching from heap
Index Only Scan -- best; all data from the index (covering index)
Bitmap Index Scan -- combines multiple indexes; okay for moderate selectivity
Nested Loop -- fine for small outer sets; watch for large outer sets
Hash Join -- good for large equi-joins
Sort -- check for "external merge Disk" (spilling to disk = bad)EXPLAIN checklist:
- No unexpected
Seq Scanon large tables -
rowsestimate within 10x ofactual rows(withEXPLAIN ANALYZE) - No
Sort Method: external merge Disk - Join order makes sense (smaller table drives the join)
3. Migration Safety Patterns
The Expand-Contract Pattern
Use for any rename, type change, or restructure with zero downtime.
| Step | Action | Example |
|---|---|---|
| 1. Expand | Add new structure alongside old | ALTER TABLE users ADD COLUMN handle TEXT; |
| 2. Dual-write | Write to both columns; read new with fallback to old | Application code change |
| 3. Backfill | Populate new column for existing rows (in batches) | UPDATE users SET handle = username WHERE handle IS NULL LIMIT 5000; |
| 4. Cutover | All code reads/writes only new column | Application code change |
| 5. Contract | Drop old column in subsequent deploy | ALTER TABLE users DROP COLUMN username; |
Online DDL Safety
| Operation | Safe Online? | Notes |
|---|---|---|
ADD COLUMN (nullable, no default) |
Yes | Metadata-only in most DBs |
ADD COLUMN with default |
Depends | PostgreSQL 11+ safe; older versions rewrite table |
DROP COLUMN |
Yes (logically) | Mark unused first |
RENAME COLUMN |
No | Breaks running application code instantly |
ALTER COLUMN TYPE |
No | May rewrite table; use expand-contract |
ADD INDEX |
Use CONCURRENTLY |
Avoids write locks |
ADD NOT NULL |
Risky | Full table scan for validation; add as check constraint first |
Zero-Downtime Migration Checklist
- Migration only adds - no renames, drops, or type changes
- New columns are nullable or have a safe default
- Indexes created with
CONCURRENTLY(or equivalent) - No full-table locks held for more than a few seconds
- Application code backward-compatible with old and new schema
- Backfills run in batches (1k-10k rows), not a single UPDATE
- Rollback plan documented
- Tested against production-sized dataset
Backfill Pattern
-- Pseudocode: batched backfill
cursor = 0
LOOP:
rows = SELECT id FROM table WHERE new_col IS NULL AND id > cursor
ORDER BY id LIMIT 5000
IF rows IS EMPTY: BREAK
UPDATE table SET new_col = derive(old_col) WHERE id IN (rows)
cursor = MAX(rows.id)
SLEEP 100ms -- throttle to reduce replication lag4. Schema Evolution
Nullable Columns vs New Tables
| Approach | Use When |
|---|---|
| Nullable column | Data belongs to same entity; most rows will eventually have it; simple scalar |
| New table (1:1 FK) | Data is sparse (< 20% of rows); complex sub-object; different access patterns |
JSON Columns
| JSON is okay when... | JSON is tech debt when... |
|---|---|
| Schema-less by nature (user prefs, feature flags) | You query individual fields in WHERE clauses |
| Rarely queried, mostly stored/retrieved whole | You need referential integrity on nested data |
| Evolves faster than your deploy cycle | Multiple tables reference the same nested structure |
| Event payloads, audit log metadata | You write data->>'field' in every query |
If you index a JSON field more than twice, extract it into a real column.
Soft Deletes vs Hard Deletes
| Factor | Soft Delete (deleted_at) |
Hard Delete (DELETE) |
|---|---|---|
| Data recovery | Easy - clear the flag | Requires backups |
| Query complexity | Every query needs WHERE deleted_at IS NULL |
Queries are simple |
| Storage | Grows forever | Reclaims space |
| GDPR compliance | Does NOT satisfy "right to erasure" | Satisfies erasure |
| Performance over time | Table bloat slows queries | Table stays lean |
Recommended default: Hard delete with an audit log table for recovery needs.
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
table_name TEXT NOT NULL,
record_id BIGINT NOT NULL,
action TEXT NOT NULL, -- 'INSERT', 'UPDATE', 'DELETE'
old_data JSONB,
new_data JSONB,
performed_by TEXT,
performed_at TIMESTAMPTZ DEFAULT now()
);If you must soft delete, use a partial index: CREATE INDEX idx_active ON users (email) WHERE deleted_at IS NULL;
Temporal Tables
When you need full history of every change, maintain a _history table:
CREATE TABLE products_history (
history_id BIGSERIAL PRIMARY KEY,
id INT NOT NULL, -- FK to products
name TEXT NOT NULL,
price_cents INT NOT NULL,
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ, -- NULL = current version
changed_by TEXT
);
-- Populate via trigger or application hook on every UPDATE/DELETE5. Data Types
Primary Keys: UUID vs Auto-Increment
| Factor | Auto-Increment (BIGSERIAL) |
UUID (v4 or v7) |
|---|---|---|
| Size | 8 bytes | 16 bytes |
| Index performance | Excellent (sequential, no page splits) | v4: poor (random). v7: good (time-sorted) |
| Multi-source merge | Conflicts guaranteed | No conflicts |
| URL guessability | Trivially enumerable | Not guessable |
| Client-side generation | Not possible | Possible (offline/distributed) |
Recommended default: BIGSERIAL PK + separate public UUID column.
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
public_id UUID NOT NULL DEFAULT gen_random_uuid() UNIQUE,
email TEXT NOT NULL
);
-- Internal joins use 'id' (fast). APIs expose 'public_id' (safe).For distributed/multi-region systems, use UUIDv7 (time-sorted) as the PK directly.
Timestamps
Always use TIMESTAMPTZ. No exceptions. Store UTC, convert in the application layer.
created_at TIMESTAMPTZ NOT NULL DEFAULT now() -- good
created_at TIMESTAMP NOT NULL DEFAULT now() -- bad: loses timezone contextMoney
Store as integer cents. Never use FLOAT or DOUBLE.
| Approach | Pros | Cons |
|---|---|---|
INT (cents) |
Fast, no rounding, easy math | Must convert for display |
NUMERIC(p,s) |
Exact decimals, multi-currency friendly | Slower than integer math |
FLOAT/DOUBLE |
Never use for money | 0.1 + 0.2 = 0.30000000000000004 |
Enums: Column vs Lookup Table
| Approach | Use When | Avoid When |
|---|---|---|
| Check constraint | < 5 values, almost never changes | Values change quarterly |
Enum type (PG CREATE TYPE) |
Moderate set, rarely changes | Need to remove values (PG enums can't drop) |
| Lookup table | Values change often, need metadata, referenced across tables | Only 2-3 static values |
Recommended default: Check constraint for tiny static sets; lookup table for everything else.
-- Check constraint
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
status TEXT NOT NULL CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered'))
);
-- Lookup table
CREATE TABLE order_statuses (
id SMALLINT PRIMARY KEY,
name TEXT NOT NULL UNIQUE
);
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
status_id SMALLINT NOT NULL REFERENCES order_statuses(id)
);Quick Decision Cheat Sheet
| Question | Default | Deviate When |
|---|---|---|
| How far to normalize? | 3NF | Read-heavy reporting needs denormalized views |
| Which index type? | B-tree | Arrays/JSONB (GIN), spatial (GiST), pure equality (hash) |
| Composite index order? | Equality first, then range/sort | Benchmark proves otherwise |
| Column rename strategy? | Expand-contract | Maintenance window is acceptable |
| Soft or hard delete? | Hard delete + audit log | Undo feature required in product |
| UUID or auto-increment? | BIGSERIAL + public UUID | Distributed system (use UUIDv7) |
| Timestamp type? | TIMESTAMPTZ, always | Never deviate |
| Money storage? | Integer cents | Multi-currency precision (use NUMERIC) |
| Enum or lookup table? | Check constraint < 5; lookup table otherwise | PG enum if values never removed |
| JSON column? | Only for schema-less data | Extract fields you query or index |
security.md
Security Reference
Opinionated security guidance for backend engineers. When in doubt, pick the more restrictive option. Security mistakes are silent until they are catastrophic.
1. Authentication Patterns
Comparison
| Method | Best for | Token location | Revocable? | Stateless? |
|---|---|---|---|---|
| Session-based | Traditional web apps | httpOnly cookie (server stores session) | Yes (delete from store) | No |
| JWT | SPAs, mobile, microservices | httpOnly cookie (preferred) | Not natively (use deny-list) | Yes |
| API keys | Service-to-service, third-party | Authorization header | Yes (rotate key) | Yes |
Recommendation: Use JWT in httpOnly cookies for user-facing apps. Use API keys + HMAC signatures for service-to-service. Never store tokens in localStorage or sessionStorage - XSS can steal them.
OAuth 2.0 / OIDC Flows
| Flow | Use when |
|---|---|
| Authorization Code + PKCE | User-facing web/mobile apps (always use PKCE, even for confidential clients) |
| Client Credentials | Service-to-service with no user context |
| Device Code | CLI tools, smart TVs, IoT |
Never use Implicit flow - it is deprecated. Never use Resource Owner Password flow unless migrating a legacy system and you control both client and server.
Token Refresh Pattern
1. Access token: short-lived (5-15 minutes)
2. Refresh token: longer-lived (7-30 days), stored httpOnly cookie, rotated on use
3. On 401 -> call /token/refresh with refresh token
4. Server issues new access + refresh token pair, invalidates old refresh token
5. If refresh token is reused (already rotated) -> revoke entire session (token theft)MFA Considerations
- Prefer TOTP (app-based) over SMS (SIM-swapping risk)
- Support WebAuthn/passkeys as the strongest option
- Store MFA secrets encrypted at rest, never in plain text
- Implement MFA as a separate authentication step - do not issue full session until MFA completes
- Provide recovery codes (one-time use, hashed like passwords)
2. Authorization
Model Comparison
| Model | How it works | Best for | Complexity |
|---|---|---|---|
| RBAC | User -> Role -> Permissions | Simple apps, internal tools | Low |
| ABAC | Policy evaluated against user/resource/environment attributes | Fine-grained rules, compliance-heavy | Medium |
| ReBAC | Permissions derived from relationships (user owns resource) | Social apps, multi-tenant, Google Zanzibar-style | High |
Recommendation: Start with RBAC. Move to ReBAC when you need relationship-based checks (e.g., "user can edit this document because they are in the owning org"). Use ABAC when you need policy rules that combine time, location, and resource attributes.
Enforcement Rules
- Default deny - if no rule explicitly grants access, deny
- Enforce at middleware layer - every route/resolver must pass through auth check
- Never trust client-side auth checks alone - they are UX hints, not security
- Check permissions on the resource, not just the route - a valid user hitting
/api/orders/123must own order 123 or have admin role
Row-Level Security Pattern
// Middleware approach
function authorize(user, resource_id):
resource = db.find(resource_id)
if resource is null:
return 404 // don't leak existence
if user.role == "admin":
return allow
if resource.owner_id != user.id:
return 403
return allow
// Database approach (Postgres RLS)
// ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
// CREATE POLICY user_orders ON orders
// USING (owner_id = current_setting('app.user_id')::int);Permission Checking Patterns
// Centralized permission check - preferred
function checkPermission(user, action, resource):
permissions = getPermissionsForRole(user.role)
if action not in permissions:
throw Forbidden
if resource.requiresOwnership and resource.owner_id != user.id:
throw Forbidden
// Usage in every handler
function handleDeleteOrder(request):
checkPermission(request.user, "orders:delete", order)
// ... proceed3. OWASP Top 10 - Practical Prevention
| # | Vulnerability | Prevention | Code pattern |
|---|---|---|---|
| A01 | Broken Access Control | Default deny, server-side checks on every request, deny by HTTP method | checkPermission() middleware on all routes |
| A02 | Cryptographic Failures | TLS everywhere, AES-256 at rest, Argon2 for passwords | Never log secrets, PII, or tokens |
| A03 | Injection | Parameterized queries, ORM with bind params, never string-concat SQL | db.query("SELECT * FROM users WHERE id = $1", [id]) |
| A04 | Insecure Design | Threat modeling, abuse case stories, rate limiting business logic | Limit failed login attempts per account |
| A05 | Security Misconfiguration | Harden defaults, disable debug in prod, remove unused endpoints | Automated config scanning in CI |
| A06 | Vulnerable Components | Dependency scanning (Dependabot, Snyk), pin versions, update regularly | npm audit / pip audit in CI pipeline |
| A07 | Auth Failures | MFA, rate limit login, lock after N failures, credential stuffing protection | Argon2 hashing, no default credentials |
| A08 | Data Integrity Failures | Verify signatures, use integrity checks on CI/CD, sign artifacts | Verify dependency checksums |
| A09 | Logging Failures | Log auth events, access denied, input validation failures | Structured logs with user ID, IP, action |
| A10 | SSRF | Allowlist outbound URLs, block internal IPs, validate/sanitize URLs | Reject 127.0.0.1, 169.254.x.x, 10.x.x.x in user input |
Injection Prevention Checklist
- All SQL uses parameterized queries or ORM with bind parameters
- No string concatenation or template literals in any query
- NoSQL queries use typed parameters (no raw
$whereor$regexfrom user input) - OS command execution is avoided; if unavoidable, use allowlist of commands
- LDAP queries use proper escaping
Rate Limiting Login
// Per-account + per-IP limiting
function handleLogin(ip, email, password):
if rateLimiter.isBlocked(key="login:" + email, max=5, window=15min):
return 429 "Too many attempts, try again later"
if rateLimiter.isBlocked(key="login-ip:" + ip, max=20, window=15min):
return 429
result = authenticate(email, password)
if result == failure:
rateLimiter.increment("login:" + email)
rateLimiter.increment("login-ip:" + ip)
return 401 // use generic message: "Invalid email or password"
rateLimiter.reset("login:" + email)
return 200 with session4. Secrets Management
Environment by Maturity
| Environment | Approach | Acceptable? |
|---|---|---|
| Local dev | .env file (in .gitignore) |
Yes |
| CI/CD | Pipeline secrets (GitHub Actions secrets, GitLab CI vars) | Yes |
| Staging/Prod | Secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager) | Required |
| Anywhere | Hardcoded in source code | Never |
Recommendations
- Use HashiCorp Vault for multi-cloud or on-prem. Use AWS Secrets Manager if all-in on AWS. Use GCP Secret Manager if all-in on GCP.
- Rotate secrets regularly - automate rotation with the secrets manager's built-in rotation (e.g., AWS Secrets Manager supports Lambda-based rotation)
- Least privilege - each service gets only the secrets it needs
- Audit access - log every secret read
Secret Hygiene Checklist
-
.envis in.gitignore - No secrets in Dockerfiles, docker-compose files, or Kubernetes manifests
- No secrets in application logs (mask or redact)
- Secret scanning enabled in CI (e.g.,
gitleaks,trufflehog, GitHub secret scanning) - Pre-commit hook blocks commits containing high-entropy strings or known patterns
- Secrets are fetched at runtime, not baked into container images
- All secrets have a documented rotation schedule
If a Secret Leaks
1. Rotate the compromised secret immediately
2. Audit access logs for unauthorized use
3. Determine blast radius (what systems/data were accessible?)
4. Notify affected parties per incident response policy
5. Post-mortem: how did it leak? Fix the process.5. Encryption
Encryption at Rest and in Transit
| Context | Algorithm | Notes |
|---|---|---|
| Data in transit | TLS 1.3 (minimum TLS 1.2) | Terminate at load balancer, re-encrypt to backend if needed |
| Data at rest (general) | AES-256-GCM | Use cloud provider's KMS for key management |
| Password hashing | Argon2id | Preferred. Fallback: bcrypt (cost 12+). Never MD5, SHA-1, SHA-256 |
| API request signing | HMAC-SHA256 | Sign method + path + timestamp + body hash |
| File integrity | SHA-256 | For checksums and verification only, never for passwords |
Password Hashing Decision
Use Argon2id (winner of Password Hashing Competition)
- memory: 64MB minimum (higher is better)
- iterations: 3 minimum
- parallelism: 1 (or match CPU cores)
If Argon2id unavailable:
Use bcrypt with cost factor 12+
If bcrypt unavailable:
Use scrypt with N=2^15, r=8, p=1
Never use:
- MD5, SHA-1, SHA-256 for passwords (fast hashes = fast brute force)
- Unsalted hashes of any kind
- Custom/homegrown hashing schemesHMAC for API Signatures
// Signing (client side)
function signRequest(method, path, body, secret, timestamp):
payload = method + "\n" + path + "\n" + timestamp + "\n" + sha256(body)
signature = hmac_sha256(secret, payload)
return base64(signature)
// Verification (server side)
function verifyRequest(request, secret):
timestamp = request.headers["X-Timestamp"]
if abs(now() - timestamp) > 5 minutes:
return reject // replay attack prevention
expected = signRequest(request.method, request.path, request.body, secret, timestamp)
if not constantTimeCompare(expected, request.headers["X-Signature"]):
return reject
return acceptKey Management Rules
- Never store encryption keys alongside encrypted data
- Use cloud KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault) for key management
- Implement key rotation - encrypt new data with new key, re-encrypt old data on access
- Use envelope encryption: KMS encrypts a data key, data key encrypts actual data
- Log all key usage for audit
6. Security Headers and Transport
Required Headers
| Header | Value | Purpose |
|---|---|---|
Strict-Transport-Security |
max-age=63072000; includeSubDomains; preload |
Force HTTPS for 2 years |
Content-Security-Policy |
default-src 'self'; script-src 'self' |
Prevent XSS via inline scripts |
X-Content-Type-Options |
nosniff |
Prevent MIME type sniffing |
X-Frame-Options |
DENY |
Prevent clickjacking |
Referrer-Policy |
strict-origin-when-cross-origin |
Limit referrer leakage |
Permissions-Policy |
camera=(), microphone=(), geolocation=() |
Disable unused browser features |
CORS Configuration
// Be explicit. Never use origin: "*" with credentials.
cors_config = {
allowed_origins: ["https://app.example.com"], // explicit list, no wildcards
allowed_methods: ["GET", "POST", "PUT", "DELETE"],
allowed_headers: ["Authorization", "Content-Type"],
allow_credentials: true,
max_age: 86400 // preflight cache: 24 hours
}CORS rules:
- Never allow
*origin withcredentials: true - Allowlist specific origins, do not reflect the Origin header blindly
- Keep
max_agehigh to reduce preflight requests
Rate Limiting
| Strategy | Use case | Implementation |
|---|---|---|
| Token bucket | User-facing APIs | Allows bursts, refills at steady rate |
| Fixed window | Internal services | Simple, predictable |
| Sliding window | High-accuracy rate limiting | Smooths boundary spikes |
- Always return
429 Too Many RequestswithRetry-Afterheader - Rate limit by authenticated user ID, fall back to IP for unauthenticated requests
- Set request body size limits (e.g., 1MB default, higher for file upload endpoints)
Input Sanitization
- Validate at system boundaries (API gateway, controller layer)
- Use allowlists for expected formats (regex for email, UUID pattern for IDs)
- Reject unexpected fields (strict schema validation)
- Encode output contextually (HTML-encode for HTML, JSON-encode for JSON)
- Never trust
Content-Typeheader alone - validate actual content
7. Security Checklist for New Services
Use this checklist before deploying any new backend service.
Authentication and Authorization
- Authentication is required on all non-public endpoints
- Authorization checks enforce default deny
- Permissions checked per-request at middleware level, not just at the edge
- Password hashing uses Argon2id (or bcrypt with cost 12+)
- Session/token expiration is configured (access: 15min, refresh: 30 days max)
- Failed login attempts are rate-limited per account and per IP
- MFA is available for privileged accounts
Transport and Encryption
- TLS 1.2+ required on all endpoints (TLS 1.3 preferred)
- HSTS header set with long max-age
- Sensitive data encrypted at rest using AES-256-GCM
- Database connections use TLS
- Internal service-to-service calls use mTLS or signed requests
Secrets
- No secrets in source code, Dockerfiles, or CI config files
- Secrets loaded from a secrets manager at runtime
-
.envin.gitignore - Secret scanning enabled in CI pipeline
- Rotation schedule documented for all secrets
Dependencies
- Dependency vulnerability scanning in CI (Dependabot, Snyk, or
npm audit) - No dependencies with known critical CVEs
- Dependencies pinned to specific versions (lockfile committed)
- Base container images are minimal and regularly updated
Logging and Monitoring
- Authentication events logged (login, logout, failure, MFA)
- Authorization failures logged with user ID, IP, and attempted resource
- No secrets, tokens, or passwords in logs
- PII in logs is masked or excluded
- Alerts configured for anomalous auth patterns (spike in 401s/403s)
Input Validation
- All SQL queries use parameterized statements
- Request body size limits enforced
- Input validated against strict schemas at API boundary
- File uploads validated for type, size, and scanned for malware
- CORS configured with explicit origin allowlist
Security Headers
- HSTS, CSP, X-Content-Type-Options, X-Frame-Options all set
- CORS does not use wildcard origin with credentials
- API versioning in place to avoid breaking changes silently
Quick Decision Reference
"Which auth pattern should I use?"
- User-facing web app -> OAuth 2.0 Authorization Code + PKCE, JWT in httpOnly cookie
- Mobile app -> Same as web, with secure storage for refresh token
- Service-to-service -> Client Credentials flow or API key + HMAC
- Third-party integration -> API key with scoped permissions
"Which hashing algorithm?"
- Passwords -> Argon2id, always
- Integrity checks -> SHA-256
- API signing -> HMAC-SHA256
- Never -> MD5, SHA-1 for anything security-related
"Where do I put secrets?"
- Dev ->
.envfile (gitignored) - CI -> Pipeline secret variables
- Prod -> Secrets manager (Vault / AWS / GCP), fetched at runtime
Frequently Asked Questions
What is backend-engineering?
Use this skill when designing backend systems, databases, APIs, or services. Triggers on schema design, database migrations, indexing strategies, distributed systems architecture, microservices, caching, message queues, observability setup, logging, metrics, tracing, SLO/SLI definition, performance optimization, query tuning, security hardening, authentication, authorization, API design (REST, GraphQL, gRPC), rate limiting, pagination, and failure handling patterns. Acts as a senior backend engineering advisor for mid-level engineers leveling up.
How do I install backend-engineering?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill backend-engineering in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support backend-engineering?
backend-engineering works with claude-code, gemini-cli, openai-codex, mcp. Install it once and use it across any supported AI coding agent.