chaos-engineering

Use this skill when implementing chaos engineering practices, designing fault injection experiments, running game days, or improving system resilience. Triggers on chaos engineering, fault injection, Chaos Monkey, Litmus, game days, resilience testing, failure modes, blast radius, and any task requiring controlled failure experimentation.

What is chaos-engineering?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill chaos-engineering
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The chaos-engineering skill is now active and ready to use

Overview Files

chaos-engineering

chaos-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Implementing chaos engineering practices, designing fault injection experiments, running game days, or improving system resilience.

Quick Facts

Field	Value
Category	engineering
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill chaos-engineering

The chaos-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A practitioner's framework for running controlled failure experiments in production systems. This skill covers how to design, execute, and learn from chaos experiments - from simple latency injections to full game days - with an emphasis on safety, minimal blast radius, and translating findings into durable resilience improvements.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair chaos-engineering with these complementary skills:

Frequently Asked Questions

What is chaos-engineering?

How do I install chaos-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill chaos-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support chaos-engineering?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Chaos Engineering

When to use this skill

Trigger this skill when the user:

Wants to design a chaos experiment or fault injection scenario
Is setting up a chaos engineering program from scratch
Needs to implement network latency, packet loss, or service dependency failures
Is planning or facilitating a game day exercise
Needs to validate circuit breakers, retries, or failover logic under real failure conditions
Wants to measure and improve MTTR (Mean Time to Recovery)
Is evaluating chaos tooling (Chaos Monkey, Litmus, Gremlin, AWS Fault Injection Simulator)

Do NOT trigger this skill for:

Writing standard retry or circuit breaker code without the intent to test it under chaos (use backend-engineering skill)
Load testing or performance benchmarking that does not involve failure injection (use performance-engineering skill)

Key principles

Define steady state before breaking anything - You cannot detect a deviation without a baseline. Before every experiment, define the precise metric (p99 latency, error rate, success count) that proves the system is healthy. If the system is already degraded, stop and fix it first.
Start small in staging, graduate to production slowly - Every experiment starts in a non-production environment. Only move to production after the hypothesis is proven correct in staging and blast radius is understood. Even in production, target a small traffic percentage or a single availability zone first.
Minimize blast radius - The experiment scope must be as small as possible. Isolate the failure to one service, one host, or one region. Have a kill switch ready before starting. The goal is learning, not causing an incident.
Build the hypothesis before turning on the failure - A hypothesis has three parts: "When X happens, the system will Y, as evidenced by Z metric." Without a pre-written hypothesis you cannot distinguish a passing experiment from an outage.
Automate experiments and run them continuously - A chaos experiment run once is a one-time curiosity. Automated experiments that run on every deploy catch regressions before production. The goal is a resilience gate in CI/CD, not a quarterly fire drill.

Core concepts

Steady State Hypothesis

The foundation of every experiment. A steady state is a measurable, normal behavior of the system:

Hypothesis template:
  "Under normal conditions, [service] processes [metric] at [baseline value].
   When [failure condition] is introduced, [metric] will remain within [acceptable range]
   because [resilience mechanism] will compensate."

Example:
  "Under normal conditions, the checkout service processes 95% of requests in <500ms.
   When the inventory service has 500ms of added latency, checkout p99 will remain
   <800ms because the circuit breaker will open and return cached availability data."

Metrics for steady state (RED method):

Rate - requests per second
Errors - error rate (%)
Duration - latency percentiles (p50, p95, p99)

Blast Radius

The maximum potential impact of the experiment if something goes wrong. Always quantify before starting:

Blast radius dimension	Example	How to constrain
Traffic percentage	5% of prod requests	Feature flags, canary routing
Infrastructure scope	1 of 3 availability zones	Target specific AZ tags
Service scope	One pod/instance in the fleet	Target single hostname
Time scope	10-minute window	Automated kill switch with timeout

Experiment Lifecycle

1. DEFINE    -> Write steady state hypothesis + success/failure criteria
2. SCOPE     -> Identify target environment, blast radius, and rollback mechanism
3. INSTRUMENT -> Confirm observability is in place to measure the hypothesis metric
4. RUN       -> Inject failure; observe metric in real time
5. ANALYZE   -> Did steady state hold? If not, why? What was the real failure mode?
6. IMPROVE   -> Fix the gap. Update runbooks. Automate the experiment.
7. REPEAT    -> Re-run to confirm the fix. Graduate to broader scope.

Failure Modes Taxonomy

Category	Examples	Common tools
Network	Latency, packet loss, DNS failure, partition	tc netem, Toxiproxy, Gremlin
Resource	CPU saturation, memory pressure, disk full, fd exhaustion	stress-ng, Chaos Monkey
Dependency	Service unavailable, slow response, bad responses (500/400)	Wiremock, Litmus, FIS
Infrastructure	Pod kill, node drain, AZ outage, region failover	Chaos Monkey, Litmus, FIS
Application	Exception injection, clock skew, thread pool exhaustion	Byte Monkey, custom middleware
Data	Corrupt payload, missing field, schema mismatch	Custom fuzz harness

Common tasks

Design a chaos experiment

Use this template to structure every experiment:

## Chaos Experiment: [Short Name]

**Date:** YYYY-MM-DD
**Hypothesis:**
  When [failure condition], [service] will [expected behavior]
  as evidenced by [metric staying within range].

**Steady State (before):**
  - Metric: checkout.success_rate
  - Baseline: >= 99.5%
  - Measured via: Datadog SLO dashboard / Prometheus query

**Failure injection:**
  - Tool: Toxiproxy / Litmus / AWS FIS
  - Target: inventory-service, 1 of 5 pods
  - Type: HTTP 503 response, 100% of requests to /api/stock
  - Duration: 10 minutes

**Blast radius:**
  - Scope: Single pod in staging environment
  - Traffic affected: ~20% of inventory requests
  - Kill switch: `kubectl delete chaosexperiment inventory-latency`

**Success criteria:**
  - checkout.success_rate remains >= 99.5% during injection
  - Circuit breaker opens within 30s
  - Fallback (cached stock) is served to users

**Failure criteria:**
  - checkout.success_rate drops below 99% for > 2 minutes
  - Any user-visible 500 errors during injection

**Result:** [PASS / FAIL]
**Finding:** [What actually happened]
**Action:** [Ticket number + fix description]

Implement network latency injection

Inject latency at the network level using Linux Traffic Control (tc) or Toxiproxy (application-level proxy). Prefer Toxiproxy for service-specific targeting; prefer tc for host-level experiments.

Using Toxiproxy (service-level, recommended for staging):

# Install and start Toxiproxy
toxiproxy-server &

# Create a proxy for the downstream service
toxiproxy-cli create --listen 0.0.0.0:8474 --upstream inventory-svc:8080 inventory_proxy

# Add 200ms of latency with 50ms jitter to 100% of connections
toxiproxy-cli toxic add inventory_proxy \
  --type latency \
  --attribute latency=200 \
  --attribute jitter=50 \
  --toxicity 1.0

# Point your service at localhost:8474 instead of inventory-svc:8080
# ... run the experiment, observe metrics ...

# Remove the toxic (kill switch)
toxiproxy-cli toxic remove inventory_proxy --toxicName latency_downstream

Using tc netem (host-level, for infrastructure experiments):

# Add 300ms latency + 30ms jitter to all outbound traffic on eth0
sudo tc qdisc add dev eth0 root netem delay 300ms 30ms

# Add 10% packet loss
sudo tc qdisc change dev eth0 root netem loss 10%

# Remove (kill switch)
sudo tc qdisc del dev eth0 root

Always test the kill switch before starting the experiment. A failed kill switch turns a chaos experiment into a real incident.

Simulate service dependency failure

Test what happens when a downstream service becomes unavailable. Use Wiremock or a simple mock server to return error responses:

// Using Wiremock (Java/Docker) - stub 100% 503s for /api/stock
{
  "request": { "method": "GET", "urlPattern": "/api/stock/.*" },
  "response": {
    "status": 503,
    "headers": { "Content-Type": "application/json" },
    "body": "{\"error\": \"Service Unavailable\"}",
    "fixedDelayMilliseconds": 5000
  }
}

// Verify your circuit breaker opened:
//   - Log line: "Circuit breaker OPEN for inventory-service"
//   - Metric: circuit_breaker_state{service="inventory"} == 1
//   - Fallback response served to callers

Checklist for dependency failure experiments:

Circuit breaker opens within the configured threshold
Fallback value or cached response is served (not a 500)
Downstream errors do not propagate to user-facing error rate
Circuit breaker closes when dependency recovers
Alerting fires within SLO window, not after it

Run a game day - facilitation guide

A game day is a structured, cross-team exercise that rehearses failure scenarios. It combines chaos experiments with human coordination practice.

Preparation (2 weeks before):

Choose a realistic scenario (e.g., "Primary database AZ goes down")
Define the experiment scope and blast radius in writing
Confirm on-call rotation and escalation paths are documented
Brief all participants: on-call engineers, product owner, leadership observer
Set up a dedicated incident Slack channel and shared dashboard link

Day-of agenda (3-hour format):

00:00 - 00:15  Kickoff: review scenario, confirm kill switches, assign roles
               Roles: Incident Commander, Chaos Operator, Scribe, Observer
00:15 - 00:30  Baseline check: confirm steady state metrics look healthy
00:30 - 01:30  Inject failure; team responds as if it were a real incident
               Scribe records every action and timestamp
01:30 - 01:45  Halt injection; confirm system recovers to steady state
01:45 - 02:30  Hot debrief: timeline walkthrough while memory is fresh
               Key questions: What surprised you? Where were the gaps?
02:30 - 03:00  Action items: each gap gets a ticket, owner, and due date

Post-game day outputs:

Updated runbook with gaps filled
Action items tracked in a backlog with SLO-aligned due dates
Recorded MTTR for the scenario (use as a benchmark for next game day)
Decision on whether to automate the experiment in CI

Test database failover

Verify that your application correctly handles a primary database failover without data loss or extended downtime:

# 1. Confirm replication lag is near zero before starting
#    psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"

# 2. Start continuous writes to the primary (background process)
while true; do
  psql -h primary -c "INSERT INTO chaos_probe (ts) VALUES (now());" 2>&1
  sleep 0.5
done &
PROBE_PID=$!

# 3. Inject: promote the replica (or use your cloud provider's failover API)
#    AWS RDS: aws rds failover-db-cluster --db-cluster-identifier my-cluster
#    Manual:  pg_ctl promote -D /var/lib/postgresql/data

# 4. Observe:
#    - How long until the application reconnects?
#    - Were any writes lost? (check probe table row count)
#    - Did health checks detect the failover promptly?
#    - Did connection pool recover without restart?

# 5. Kill the probe writer
kill $PROBE_PID

# 6. Measure:
#    - Connection downtime: seconds between last successful write and first write to new primary
#    - Data loss: rows missing from probe table
#    - Recovery time: time until application traffic normalizes

Success criteria: Connection re-established within 30s, zero data loss, no application restart required.

Implement circuit breaker validation

After implementing a circuit breaker, verify it actually works under failure conditions. This is the most commonly skipped verification step.

# Validation test: assert circuit breaker opens under failure threshold
import pytest
import time
from unittest.mock import patch

def test_circuit_breaker_opens_on_failure_threshold():
    cb = CircuitBreaker(threshold=5, reset_ms=30000)
    failures = 0

    def failing_op():
        raise ConnectionError("downstream unavailable")

    # Exhaust the threshold
    for _ in range(5):
        with pytest.raises((ConnectionError, CircuitOpenError)):
            cb.call(failing_op)

    # Next call must fast-fail without calling the dependency
    call_count = 0
    def counting_op():
        nonlocal call_count
        call_count += 1
        return "ok"

    with pytest.raises(CircuitOpenError):
        cb.call(counting_op)

    assert call_count == 0, "Circuit breaker must NOT call the dependency when OPEN"
    assert cb.state == OPEN

def test_circuit_breaker_recovers_after_reset_timeout():
    cb = CircuitBreaker(threshold=5, reset_ms=100)  # 100ms for test speed
    # ... trip the breaker ...
    time.sleep(0.15)
    # Should transition to HALF-OPEN and allow one trial call
    result = cb.call(lambda: "ok")
    assert cb.state == CLOSED

Experiment to run in staging:

Deploy with circuit breaker configured
Use Toxiproxy to make the dependency return 503
Verify: breaker opens within threshold, fallback activates, logs confirm state transitions
Remove the toxic, verify: breaker moves to half-open, trial succeeds, breaker closes

Measure and improve MTTR

MTTR (Mean Time to Recovery) is the primary output metric of a chaos engineering program. Improve it by reducing each phase:

Incident timeline phases:
  Detection  - time from failure start to alert firing
  Triage     - time from alert to understanding root cause
  Response   - time from diagnosis to fix applied
  Recovery   - time from fix applied to steady state restored

MTTR = Detection + Triage + Response + Recovery

Measurement query (Prometheus example):

# Time from incident start (SLO breach) to recovery (SLO restored)
# Track this per incident type in a spreadsheet; compute rolling mean

# Alert on SLO burn rate (detection proxy):
(
  rate(http_requests_total{status=~"5.."}[5m]) /
  rate(http_requests_total[5m])
) > 0.01  # >1% error rate

Improvement levers by phase:

Phase	Common gap	Fix
Detection	Alert fires 10 min after incident	Lower burn rate window; add synthetic monitors
Triage	Engineers don't know which runbook to use	Link runbook URL directly in alert body
Response	Fix requires manual steps	Automate the fix (restart script, failover trigger)
Recovery	Traffic does not drain back after fix	Add health check gates to deployment pipeline

Track MTTR per failure category. A single average hides that your database failovers recover in 2 min but your certificate expiry incidents take 45 min.

Anti-patterns

Anti-pattern	Why it's wrong	What to do instead
Running chaos in production before staging	Turns an experiment into an incident	Always validate hypothesis in staging first; graduate scope incrementally
No hypothesis before starting	Cannot distinguish experiment result from coincidence	Write the three-part hypothesis (condition, behavior, metric) before touching anything
Missing kill switch	Experiment cannot be stopped if it goes wrong	Test the kill switch before injecting; automate it with a timeout
Chaos without observability	Impossible to measure steady state deviation	Confirm dashboards and alerts are live before starting; abort if blind
One-time game days without automation	Resilience regresses between exercises	Automate the experiment; run in CI on every deploy or weekly schedule
Targeting production at full scale first	Single experiment can cause a real outage	Start with 1 pod / 1% traffic / 1 AZ; expand only after confirming safety

Gotchas

Kill switch first, experiment second - The most common mistake is discovering the kill switch doesn't work only after the experiment has started causing damage. Always test the kill switch (e.g., kubectl delete chaosexperiment, toxiproxy-cli toxic remove) before injecting any failure.
Observability blind spots - If your metrics pipeline routes through the same service you're injecting failure into, you'll lose visibility at exactly the moment you need it most. Confirm that dashboards and alerting are independent of the experiment target before starting.
Staging ≠ production behavior - A hypothesis that holds in staging often fails in production due to traffic volume differences, connection pool sizing, or infrastructure configurations that only exist in prod. Graduate scope incrementally - don't treat a staging pass as proof production will hold.
Circuit breaker misconfiguration in tests - Unit tests often use a timeout of 0 or 1ms for the circuit breaker reset window to speed tests up. The production timeout may be 30 seconds. Validate circuit breaker behavior with production-realistic timeouts in at least one integration test.
Experiment automation without human review - Fully automated chaos experiments that run on every deploy are the goal, but skipping the review step when a new experiment type is added risks running untested blast-radius assumptions in production. Treat new experiment types as requiring manual approval for the first 2-3 runs.

References

For experiment catalogs, failure injection recipes, and advanced tooling guidance:

references/experiment-catalog.md - ready-to-use experiments organized by failure type

Only load the references file if the current task requires a specific experiment recipe.

References

experiment-catalog.md

Chaos Experiment Catalog

Ready-to-run experiment recipes organized by failure category. Each entry includes the failure type, injection method, steady state metric, and success criteria.

1. Network Failures

1.1 Service Latency Injection

Goal: Verify that downstream latency does not propagate into user-facing latency and that timeouts + circuit breakers engage correctly.

Field	Value
Tool	Toxiproxy, tc netem, Gremlin
Target	Specific service-to-service TCP connection
Failure	300-500ms added latency on 100% of connections
Steady state	User-facing p99 < 800ms
Success	Circuit breaker opens; fallback response served; user p99 stays under SLO

# Toxiproxy: add 400ms latency to downstream
toxiproxy-cli toxic add my_proxy --type latency --attribute latency=400
# Verify circuit breaker metric: circuit_breaker_state{service="my-svc"} == 1
# Remove: toxiproxy-cli toxic remove my_proxy --toxicName latency_downstream

1.2 Packet Loss

Goal: Verify TCP retransmission behavior and connection resilience under lossy networks (common in multi-region or satellite link scenarios).

Field	Value
Tool	tc netem
Target	Host network interface
Failure	5-15% packet loss
Steady state	Service success rate >= 99%
Success	Retransmissions absorb loss; error rate remains within SLO

sudo tc qdisc add dev eth0 root netem loss 10%
# Monitor: ss -s (retransmit counter), application error rate
# Kill switch: sudo tc qdisc del dev eth0 root

1.3 DNS Resolution Failure

Goal: Verify graceful degradation when DNS is unavailable and that services do not hard-crash or hang indefinitely on startup.

Field	Value
Tool	iptables, CoreDNS config, Chaos Mesh
Target	DNS port 53 traffic to/from target service
Failure	Block UDP/TCP 53 for specific namespace
Steady state	Service resolves dependencies at startup
Success	Service uses cached DNS or returns graceful error; no hang on startup

# Block DNS for a specific process or namespace
iptables -A OUTPUT -p udp --dport 53 -j DROP
# Kill switch: iptables -D OUTPUT -p udp --dport 53 -j DROP

1.4 Network Partition

Goal: Test split-brain behavior and distributed consensus handling when two service groups cannot communicate.

Field	Value
Tool	iptables, Chaos Mesh NetworkChaos
Target	Two pods or AZs that must coordinate
Failure	Block all traffic between group A and group B
Steady state	Writes succeed, reads are consistent
Success	System correctly detects partition; writes rejected or quorum maintained

# Chaos Mesh: NetworkChaos YAML
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition-test
spec:
  action: partition
  mode: fixed
  value: "1"
  selector:
    namespaces: [staging]
    labelSelectors:
      app: order-service
  direction: both
  target:
    selector:
      namespaces: [staging]
      labelSelectors:
        app: inventory-service
  duration: "5m"

2. Resource Exhaustion

2.1 CPU Saturation

Goal: Verify that CPU saturation on one pod does not cascade to request timeouts and that Kubernetes HPA scales within an acceptable window.

Field	Value
Tool	stress-ng, Chaos Monkey for Containers
Target	Single pod in multi-replica deployment
Failure	Spin CPU to 90% utilization on target pod
Steady state	Service p99 < 500ms
Success	HPA triggers scale-out; load balancer routes away from hot pod; p99 stays under SLO

# In pod: saturate 2 CPU cores for 5 minutes
stress-ng --cpu 2 --timeout 300s

# Watch HPA: kubectl get hpa -w
# Watch pod CPU: kubectl top pods

2.2 Memory Pressure

Goal: Verify OOMKill behavior: pod restarts cleanly, persistent state is not corrupted, and traffic shifts without dropping requests.

Field	Value
Tool	stress-ng, Chaos Mesh StressChaos
Target	Single stateless service pod
Failure	Consume memory until OOMKill
Steady state	Zero OOMKills in production over 7 days
Success	Pod restarts; readiness probe prevents traffic until healthy; no user-visible errors

# Consume 512MB of memory for 3 minutes
stress-ng --vm 1 --vm-bytes 512M --timeout 180s

# Watch: kubectl describe pod <name> | grep -A5 "Last State"

2.3 File Descriptor Exhaustion

Goal: Verify that connection leak detection and fd limit alerts fire before the service becomes unresponsive.

Field	Value
Tool	Custom script (open many files/sockets)
Target	Application process
Failure	Open fd count approaches ulimit
Steady state	fd count < 70% of ulimit
Success	Alert fires; service degrades gracefully (rejects new connections with 503) before hard crash

# Check current limit: ulimit -n
# Simulate: open many sockets in a tight loop (test environment only)
# Monitor: /proc/<pid>/fd | wc -l

3. Dependency Failures

3.1 Dependency Returns 503

Goal: Test that circuit breaker opens on 503s and fallback activates instead of propagating errors to callers.

Field	Value
Tool	Wiremock, Toxiproxy (HTTP reset), Mountebank
Target	Downstream service HTTP endpoint
Failure	100% of responses return 503 with 5s delay
Steady state	Caller error rate < 0.1%
Success	Circuit breaker opens within threshold; fallback serves callers; caller error rate stays < 0.1%

// Wiremock stub returning 503
{
  "request": { "method": "ANY", "urlPattern": "/api/.*" },
  "response": {
    "status": 503,
    "fixedDelayMilliseconds": 5000,
    "body": "{\"error\": \"Service Unavailable\"}"
  }
}

3.2 Slow Third-Party API

Goal: Verify that an external API slowdown does not tie up internal thread pools and that timeouts are set correctly.

Field	Value
Tool	Toxiproxy (latency toxic), Charles Proxy
Target	Outbound HTTPS connection to external API
Failure	10s latency added to all responses (beyond configured timeout)
Steady state	Thread pool utilization < 60%
Success	Requests to external API time out (not hang indefinitely); thread pool does not exhaust; users see graceful error

toxiproxy-cli toxic add external_api_proxy \
  --type latency \
  --attribute latency=10000 \
  --toxicity 1.0
# Verify: application logs show TimeoutError, not hung threads

3.3 Message Queue Unavailability

Goal: Test producer and consumer behavior when Kafka/RabbitMQ/SQS is unreachable.

Field	Value
Tool	iptables (block broker port), Stop broker container
Target	Message broker port (9092 for Kafka, 5672 for RabbitMQ)
Failure	Block TCP connections to broker for 5 minutes
Steady state	Message processing rate > 1000 msg/s
Success	Producers buffer or retry with backoff; consumers pause and resume without data loss; no infinite retry storms

# Block Kafka broker port
iptables -A OUTPUT -p tcp --dport 9092 -j DROP
# Observe producer: check for error logs, retry metrics, DLQ activity
# Kill switch: iptables -D OUTPUT -p tcp --dport 9092 -j DROP

4. Infrastructure Failures

4.1 Pod Kill (Chaos Monkey Style)

Goal: Verify that random pod termination does not cause user-visible outages in a properly configured multi-replica deployment.

Field	Value
Tool	Chaos Monkey for Kubernetes, Litmus PodDelete, kubectl delete
Target	Random pod from a deployment with >= 3 replicas
Failure	Kill 1 pod every 60 seconds for 10 minutes
Steady state	Service availability >= 99.9%
Success	Availability stays >= 99.9%; killed pods restart within 30s; no data loss

# Manual: kill a random pod
kubectl delete pod \
  $(kubectl get pods -l app=my-service -o name | shuf | head -1) \
  --grace-period=0 --force

# Litmus PodDelete experiment:
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/pod-delete/experiment.yaml

4.2 Node Drain (Zone Simulation)

Goal: Simulate loss of one availability zone by draining all pods from a single node or set of nodes tagged with a specific AZ.

Field	Value
Tool	kubectl drain, AWS FIS
Target	1 of 3 AZ nodes (staging cluster)
Failure	Cordon + drain target node(s)
Steady state	Cross-AZ traffic balanced; availability >= 99.9%
Success	Workloads reschedule to healthy nodes within pod disruption budget; traffic redistributes; no extended downtime

# Cordon the node (no new pods)
kubectl cordon node-az-b-01

# Drain (evict existing pods)
kubectl drain node-az-b-01 --ignore-daemonsets --delete-emptydir-data --grace-period=30

# Monitor: kubectl get pods -o wide -w (watch rescheduling)
# Restore: kubectl uncordon node-az-b-01

4.3 AWS AZ Failover (FIS)

Goal: Test full AZ failover behavior in an AWS-managed environment using the native Fault Injection Simulator.

Field	Value
Tool	AWS Fault Injection Simulator
Target	EC2 instances / ECS tasks in a single AZ
Failure	Terminate all instances tagged with `az: us-east-1b`
Steady state	Multi-AZ health check passing
Success	ALB stops routing to failed AZ within 30s; traffic redistributes to remaining AZs; RDS promotes replica

// AWS FIS experiment template (abbreviated)
{
  "description": "Terminate EC2 instances in us-east-1b",
  "targets": {
    "az-b-instances": {
      "resourceType": "aws:ec2:instance",
      "filters": [
        { "path": "Placement.AvailabilityZone", "values": ["us-east-1b"] },
        { "path": "State.Name", "values": ["running"] }
      ],
      "selectionMode": "PERCENT(50)"
    }
  },
  "actions": {
    "terminate-instances": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": { "Instances": "az-b-instances" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:...error-rate-high" }
  ]
}

5. Application-Level Failures

5.1 Exception Injection

Goal: Test error handling paths that are rarely exercised in normal operation by injecting exceptions at the code level.

Field	Value
Tool	Byte Buddy (Java), custom middleware, feature flags
Target	Specific code path (e.g., payment processing function)
Failure	Throw exception on N% of calls
Steady state	Error rate < 0.1%
Success	Exception is caught; user receives graceful error; alert fires; no unhandled rejection

// Feature-flag-based exception injection (Node.js middleware)
function chaosMiddleware(req, res, next) {
  const chaosRate = parseFloat(process.env.CHAOS_EXCEPTION_RATE || '0');
  if (chaosRate > 0 && Math.random() < chaosRate) {
    throw new Error('[CHAOS] Injected exception for resilience testing');
  }
  next();
}
// Set CHAOS_EXCEPTION_RATE=0.05 for 5% injection in staging

5.2 Clock Skew

Goal: Test behavior of distributed systems when nodes disagree on wall clock time. Critical for JWT expiry, event ordering, and distributed locks.

Field	Value
Tool	`date` command, faketime, Chaos Mesh TimeChaos
Target	Single service pod
Failure	Advance clock by 10 minutes on target pod
Steady state	All JWTs and cache TTLs valid
Success	Service detects clock skew; JWTs are not prematurely expired for other nodes; distributed lock behaves correctly

# Chaos Mesh TimeChaos
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: clock-skew-test
spec:
  mode: one
  selector:
    namespaces: [staging]
    labelSelectors:
      app: auth-service
  timeOffset: "+10m"
  duration: "5m"

Tool Selection Matrix

Scenario	Best Tool	Alternative
Kubernetes pod/node failures	Litmus, Chaos Mesh	kubectl delete
Network latency/packet loss (service-level)	Toxiproxy	Chaos Mesh NetworkChaos
Network latency (host-level)	tc netem	Gremlin
AWS infrastructure failures	AWS FIS	Chaos Monkey for ECS
Multi-cloud or SaaS managed	Gremlin	-
CPU/memory stress	stress-ng, Chaos Mesh StressChaos	Gremlin
Application exception injection	Feature flags + custom middleware	Byte Buddy (JVM)
External API simulation	Wiremock, Mountebank	WireMock Cloud

Experiment Graduation Checklist

Before graduating any experiment from staging to production:

Hypothesis was validated in staging (steady state held or gap was found and fixed)
Blast radius is documented and smaller than in staging run
Kill switch is tested and confirmed to work in < 30 seconds
On-call engineer is aware and monitoring during the experiment
Rollback procedure is documented in the experiment record
Observability dashboards are confirmed live for all steady state metrics
Stop condition (automatic abort) is configured if available (e.g., AWS FIS stop condition)

Frequently Asked Questions

What is chaos-engineering?

How do I install chaos-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill chaos-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support chaos-engineering?

chaos-engineering works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is chaos-engineering free?

Yes, chaos-engineering is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between chaos-engineering and similar tools?

chaos-engineering is an AI agent skill that teaches your coding agent specialized software engineering knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use chaos-engineering with Cursor or Windsurf?

chaos-engineering works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

chaos-engineering

What is chaos-engineering?

Quick Start

chaos-engineering

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is chaos-engineering?

How do I install chaos-engineering?

What AI agents support chaos-engineering?

Maintainers

SKILL.md

Chaos Engineering

When to use this skill

Key principles

Core concepts

Steady State Hypothesis

Blast Radius

Experiment Lifecycle

Failure Modes Taxonomy

Common tasks

Design a chaos experiment

Implement network latency injection

Simulate service dependency failure

Run a game day - facilitation guide

Test database failover

Implement circuit breaker validation

Measure and improve MTTR

Anti-patterns

Gotchas

References

References

experiment-catalog.md

Chaos Experiment Catalog

1. Network Failures

1.1 Service Latency Injection

1.2 Packet Loss

1.3 DNS Resolution Failure

1.4 Network Partition

2. Resource Exhaustion

2.1 CPU Saturation

2.2 Memory Pressure

2.3 File Descriptor Exhaustion

3. Dependency Failures

3.1 Dependency Returns 503

3.2 Slow Third-Party API

3.3 Message Queue Unavailability

4. Infrastructure Failures

4.1 Pod Kill (Chaos Monkey Style)

4.2 Node Drain (Zone Simulation)

4.3 AWS AZ Failover (FIS)

5. Application-Level Failures

5.1 Exception Injection

5.2 Clock Skew

Tool Selection Matrix

Experiment Graduation Checklist

Frequently Asked Questions

What is chaos-engineering?

How do I install chaos-engineering?

What AI agents support chaos-engineering?

Is chaos-engineering free?

What is the difference between chaos-engineering and similar tools?

Can I use chaos-engineering with Cursor or Windsurf?