observability
Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
monitoring observabilityloggingmetricstracingalertingsloWhat is observability?
Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
observability
observability is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Implementing logging, metrics, distributed tracing, alerting, or defining SLOs.
Quick Facts
| Field | Value |
|---|---|
| Category | monitoring |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill observability- The observability skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Observability is the ability to understand what a system is doing from the outside by examining its outputs - without needing to modify the system or guess at internals. The three pillars are logs (what happened), metrics (how the system is performing), and traces (where time is spent across service boundaries). These pillars are only useful when correlated - a spike in your p99 metric should link to traces, and those traces should link to logs. Invest in correlation from day one, not as a retrofit.
Tags
observability logging metrics tracing alerting slo
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair observability with these complementary skills:
Frequently Asked Questions
What is observability?
Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
How do I install observability?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill observability in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support observability?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Observability
Observability is the ability to understand what a system is doing from the outside by examining its outputs - without needing to modify the system or guess at internals. The three pillars are logs (what happened), metrics (how the system is performing), and traces (where time is spent across service boundaries). These pillars are only useful when correlated - a spike in your p99 metric should link to traces, and those traces should link to logs. Invest in correlation from day one, not as a retrofit.
When to use this skill
Trigger this skill when the user:
- Adds structured logging to a service (pino, winston, log4j, Python logging)
- Instruments code with OpenTelemetry or a vendor SDK (Datadog, New Relic, Honeycomb)
- Defines SLIs, SLOs, or error budgets for a service
- Builds a Grafana or Datadog dashboard
- Writes Prometheus alerting rules or configures PagerDuty/Opsgenie routing
- Implements distributed tracing (spans, context propagation, sampling)
- Responds to alert fatigue or on-call burnout
- Tracks an incident and needs to correlate logs/traces/metrics
Do NOT trigger this skill for:
- Pure infrastructure provisioning (Terraform, Kubernetes YAML) - those are ops/IaC concerns
- Application performance profiling of CPU/memory at the code level (use a performance-engineering skill)
Key principles
Structured logging always - Every log line should be machine-parseable JSON with consistent fields. Plain-text logs cannot be queried, filtered, or aggregated at scale. Correlation IDs are non-negotiable.
USE for resources, RED for services - Resources (CPU, memory, connections) are measured with Utilization/Saturation/Errors. Services (APIs, queues) are measured with Rate/Errors/Duration. Knowing which method applies tells you which metrics to instrument before you write a single line of code.
Instrument at boundaries - Service ingress/egress, database calls, external HTTP calls, and message queue produce/consume operations are the minimum instrumentation surface. Everything else is optional until proven necessary.
Alert on symptoms, not causes - Alert when users are impacted (high error rate, high latency). Do not page on CPU at 80% or a memory warning - those are causes to investigate, not symptoms to wake someone up for.
SLOs drive decisions - Every reliability trade-off should reference an error budget. If budget is healthy, ship features. If budget is burning, stop and fix reliability. SLOs without error budgets are just numbers on a slide.
Core concepts
The three pillars
| Pillar | Question answered | What it gives you |
|---|---|---|
| Logs | What happened? | Detailed event records, debug context, audit trails |
| Metrics | How is the system performing? | Aggregated numbers over time, dashboards, alerting |
| Traces | Where did time go? | Request flow across services, latency attribution |
Cardinality
Every unique combination of label values in a metric creates a new time series in your
metrics backend. user_id as a metric label will create millions of time series and
kill Prometheus. Keep metric label cardinality under ~100 unique values per label.
Use logs or traces for high-cardinality data (user IDs, request IDs, emails).
Exemplars
Exemplars are trace IDs embedded in metric data points. When you see a p99 spike on a histogram, an exemplar lets you jump directly to a trace that caused it. OpenTelemetry and Prometheus support exemplars natively. Enable them - they are the bridge between metrics and traces.
Context propagation
Context propagation is the mechanism by which a trace ID flows through service boundaries.
The W3C traceparent header is the standard format. Every service must: extract the
header on ingress, attach it to async context, and inject it into all outbound calls.
Failing to propagate breaks trace continuity silently.
SLI / SLO / Error budget
- SLI (Service Level Indicator): A measurement of service behavior users care about.
Example:
successful_requests / total_requests - SLO (Service Level Objective): A target for an SLI over a time window. Example: 99.9% of requests succeed within 300ms, measured over 30 days
- Error budget:
1 - SLO. For a 99.9% SLO, the budget is 0.1% - about 43 minutes of downtime per month. Burn rate measures how fast you consume it.
Common tasks
Set up structured logging
Use pino for Node.js (fastest), winston for flexibility. Always include a correlation
ID middleware that attaches traceId to every log automatically.
// logger.ts - pino with correlation ID support
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: process.env.SERVICE_NAME ?? 'unknown',
version: process.env.SERVICE_VERSION ?? '0.0.0',
},
timestamp: pino.stdTimeFunctions.isoTime,
redact: ['req.headers.authorization', 'body.password', 'body.token'],
});
// Express middleware - binds traceId to every child logger in the request scope
export function loggerMiddleware(req: Request, res: Response, next: NextFunction) {
const traceId = req.headers['traceparent'] as string
?? req.headers['x-request-id'] as string
?? crypto.randomUUID();
req.log = logger.child({ traceId, method: req.method, path: req.path });
res.setHeader('x-request-id', traceId);
next();
}// Usage in a route handler
app.post('/orders', async (req, res) => {
req.log.info({ orderId: body.id }, 'Processing order');
try {
const result = await orderService.create(body);
req.log.info({ orderId: result.id, durationMs: Date.now() - start }, 'Order created');
res.json(result);
} catch (err) {
req.log.error({ err, orderId: body.id }, 'Order creation failed');
res.status(500).json({ error: 'internal_error' });
}
});Instrument with OpenTelemetry
Use the Node.js SDK with auto-instrumentation for HTTP, Express, and common DB clients. Add manual spans only for business-critical operations.
// instrumentation.ts - must be loaded before any other module (Node --require flag)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { ParentBasedSampler, TraceIdRatioBased } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
serviceName: process.env.SERVICE_NAME ?? 'my-service',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 15_000,
}),
sampler: new ParentBasedSampler({
root: new TraceIdRatioBased(0.1), // 10% head-based sampling
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());// Manual span for a business operation
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('payment.process', async (span) => {
span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });
try {
const result = await stripe.charges.create({ amount, currency: 'usd' });
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}Load
instrumentation.tsbefore your app withnode --require ./dist/instrumentation.js server.js. Seereferences/opentelemetry-setup.mdfor exporters, processors, and Python setup.
Define SLIs and SLOs
Define SLIs from the user's perspective first, then map to metrics you can measure.
# slos.yaml - document alongside your service
service: order-api
slos:
# Availability: are requests succeeding?
- name: availability
description: Fraction of requests that return non-5xx responses
sli: successful_requests / total_requests # status < 500
target: 99.9%
window: 30d
error_budget_minutes: 43.8
# Latency: are requests fast enough?
- name: latency-p99
description: 99th percentile of request duration under 500ms
sli: requests_under_500ms / total_requests
target: 99.0%
window: 30d
# Correctness: are responses valid? (measured via synthetic probes or sampling)
- name: correctness
description: Fraction of order confirmations that pass integrity check
sli: valid_order_confirmations / total_order_confirmations
target: 99.95%
window: 30dSLO burn rate formulas:
error_budget = 1 - slo_target # 0.001 for 99.9%
burn_rate = observed_error_rate / error_budget
time_to_exhaustion = window_hours / burn_rate
# Fast burn (page now): 14.4x - exhausts 30d budget in 2 days
# Slow burn (ticket): 3x - exhausts 30d budget in 10 daysCreate effective dashboards
Use the RED method layout. Eight to twelve panels per dashboard. Link to detail dashboards for drill-down rather than putting everything on one page.
Dashboard layout - <ServiceName> Overview
Row 1: [SLO Status: availability] [Error Budget: X% remaining] [Latency p99 SLO]
Row 2: [Request Rate (rps)] [Error Rate (%)] [Latency p50 / p95 / p99]
Row 3: [Errors by type/endpoint] [Top slow endpoints] [Upstream dependency latency]
Row 4: [CPU / Memory] [DB connection pool] [Queue depth / lag]Grafana panel guidelines:
- Latency: use histogram_quantile, show p50/p95/p99 on same panel
- Error rate:
rate(errors_total[5m]) / rate(requests_total[5m]) - Add deploy annotations (vertical lines) so you can correlate deployments with incidents
- Set panel thresholds to match your SLO targets (green/yellow/red)
Set up alerting without alert fatigue
Define severity tiers before writing a single rule. Map each tier to a routing target.
# Example Prometheus alerting rules (alerts.yaml)
groups:
- name: order-api.slo
rules:
# P1: fast burn - exhausts 30d budget in 2 days
- alert: HighErrorBudgetBurn
expr: |
(
rate(http_requests_errors_total[1h]) /
rate(http_requests_total[1h])
) > (14.4 * 0.001)
for: 2m
labels:
severity: p1
team: platform
annotations:
summary: "Error budget burning at 14x+ rate"
runbook: "https://runbooks.internal/order-api/high-error-burn"
dashboard: "https://grafana.internal/d/order-api"
# P3: slow burn - ticket, investigate during business hours
- alert: SlowErrorBudgetBurn
expr: |
(
rate(http_requests_errors_total[6h]) /
rate(http_requests_total[6h])
) > (3 * 0.001)
for: 1h
labels:
severity: p3
team: platform
annotations:
summary: "Error budget burning at 3x rate - investigate during business hours"Routing rules (Opsgenie / PagerDuty):
severity=p1 -> Page primary on-call immediately
severity=p2 -> Page primary on-call during business hours, silent at night
severity=p3 -> Create Jira ticket, no page
severity=p4 -> Slack notification onlyEvery alert must have: a runbook link, an owner team, and a dashboard link. If an alert fires and nobody knows what to do, the runbook is missing.
Implement distributed tracing
Instrument at service boundaries. Propagate context via W3C traceparent. Add attributes
that make traces searchable (user ID, order ID, tenant ID - as trace attributes, not
metric labels).
// Propagate context in outbound HTTP calls (fetch wrapper)
import { context, propagation } from '@opentelemetry/api';
async function tracedFetch(url: string, options: RequestInit = {}): Promise<Response> {
const headers: Record<string, string> = {
...(options.headers as Record<string, string>),
};
// Inject W3C traceparent + tracestate headers
propagation.inject(context.active(), headers);
return fetch(url, { ...options, headers });
}
// Propagate context from inbound messages (e.g. SQS / Kafka)
import { propagation, ROOT_CONTEXT } from '@opentelemetry/api';
function processMessage(message: QueueMessage) {
// Extract trace context from message attributes
const parentContext = propagation.extract(ROOT_CONTEXT, message.attributes ?? {});
return context.with(parentContext, () => {
return tracer.startActiveSpan('queue.process', (span) => {
span.setAttributes({ 'messaging.message_id': message.id });
// ... process message
span.end();
});
});
}Span attribute conventions (OpenTelemetry semantic conventions):
- HTTP:
http.method,http.status_code,http.route,net.peer.name - DB:
db.system,db.name,db.operation,db.statement(sanitized) - Business:
order.id,user.id,payment.method(custom namespace)
Monitor error budgets and act on burn rates
Track burn rate over multiple windows to distinguish spikes from trends.
// Burn rate queries (Prometheus / Grafana)
// 1-hour burn rate (catches fast incidents)
const fastBurnRate = `
(
sum(rate(http_requests_errors_total[1h])) /
sum(rate(http_requests_total[1h]))
) / 0.001
`;
// 6-hour burn rate (catches slow degradations)
const slowBurnRate = `
(
sum(rate(http_requests_errors_total[6h])) /
sum(rate(http_requests_total[6h]))
) / 0.001
`;
// Remaining error budget (30-day rolling)
const budgetRemaining = `
1 - (
sum(increase(http_requests_errors_total[30d])) /
sum(increase(http_requests_total[30d]))
) / 0.001
`;Act on burn rates:
| Burn rate | Action |
|---|---|
| > 14.4x (1h window) | Page immediately, declare incident |
| > 6x (6h window) | Page during business hours |
| > 3x (24h window) | Create reliability ticket, add to next sprint |
| < 1x | Budget healthy, normal feature development |
| Budget < 10% remaining | Freeze non-critical deploys, focus on reliability |
Anti-patterns / common mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
| Logging unstructured plain text | Cannot be searched or aggregated at scale | Emit JSON with consistent fields and correlation ID |
| High-cardinality metric labels (user_id, request_id) | Creates millions of time series, kills Prometheus | Keep cardinality < 100 per label; use traces for high-cardinality data |
| Alerting on causes (CPU > 80%) | Wakes humans for non-user-impacting events | Alert on symptoms (error rate, latency SLO burn) |
| No sampling strategy for traces | 100% trace collection at scale is cost-prohibitive | Start at 10% head-based, add tail-based for errors |
| SLOs without error budgets | SLO becomes a vanity target with no operational consequence | Define budget, burn rate thresholds, and what changes at each level |
| Missing runbooks on alerts | On-call doesn't know what to do, wasted time in incidents | Every alert ships with a runbook before it goes to production |
Gotchas
Cardinality explosion kills Prometheus - Adding a label with high cardinality (user_id, request_id, IP address) creates a new time series per unique value. A single bad label can OOM a Prometheus instance overnight. Always check cardinality before adding labels; use traces or logs for high-cardinality data.
Context propagation breaks at async boundaries - In Node.js, if you use
setTimeout,setImmediate, or create a newPromisechain without explicitly passingcontext.active(), the trace context is lost and spans appear as orphan roots. Use AsyncLocalStorage-aware frameworks or manually propagate context withcontext.with(ctx, fn).100% trace sampling in production is unsustainable - At any real scale, sampling every trace destroys budget and storage. Start at 10% head-based sampling with tail-based sampling for errors. The default
AlwaysOnSamplerin OTel SDKs is NOT suitable for production.SLO burn rate alerts on short windows produce noise - A single spike in errors can trigger a "fast burn" alert that resolves in minutes. Pair fast-window alerts (1h) with slow-window alerts (6h) using multi-window alerting. Alert only when both windows exceed the threshold simultaneously.
Structured logging without redaction leaks secrets - pino and winston log entire objects by default. Passing
reqorbodywithout aredactconfig will log Authorization headers, passwords, and tokens in plain text. Always configure theredactoption before shipping to production.
References
references/opentelemetry-setup.md- OTel SDK setup for Node.js and Python, exporters, processors, and sampling configuration
Load the references file when the task involves wiring up OpenTelemetry from scratch, configuring exporters, or setting up the collector pipeline. The skill above is enough for instrumentation patterns and SLO definitions.
References
opentelemetry-setup.md
OpenTelemetry SDK Setup Reference
OpenTelemetry (OTel) is the vendor-neutral standard for telemetry instrumentation. This reference covers the SDK setup for Node.js and Python, including exporters, processors, and samplers. For instrumentation patterns and SLO definitions, refer to the main SKILL.md.
Architecture overview
Your service
|
| SDK instruments code (traces, metrics, logs)
v
OTel SDK (in-process)
| BatchSpanProcessor -> buffers spans, exports in batches
| PeriodicExportingMetricReader -> exports metrics every N seconds
v
Exporter (OTLP over HTTP or gRPC)
v
OTel Collector (recommended - decouples your app from backend)
| Receivers: otlp
| Processors: batch, memory_limiter, resource
| Exporters: jaeger, prometheus, datadog, honeycomb, etc.
v
Observability backend (Jaeger, Grafana Tempo, Datadog, Honeycomb)Always use the OTel Collector in production. It buffers, retries, and lets you change backends without redeploying your service.
Node.js SDK
Installation
npm install \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http \
@opentelemetry/sdk-metrics \
@opentelemetry/sdk-trace-nodeFull SDK setup (TypeScript)
// src/instrumentation.ts
// MUST be loaded before any other module.
// Run with: node --require ./dist/instrumentation.js ./dist/server.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node';
import { ParentBasedSampler, TraceIdRatioBased } from '@opentelemetry/sdk-trace-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION, SEMRESATTRS_DEPLOYMENT_ENVIRONMENT } from '@opentelemetry/semantic-conventions';
const resource = Resource.default().merge(
new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown-service',
[SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
[SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',
})
);
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT
? `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`
: 'http://localhost:4318/v1/traces',
headers: {
// For vendors like Honeycomb or Grafana Cloud that need auth:
...(process.env.OTEL_EXPORTER_OTLP_HEADERS
? Object.fromEntries(
process.env.OTEL_EXPORTER_OTLP_HEADERS.split(',').map((h) => h.split('='))
)
: {}),
},
});
const metricExporter = new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT
? `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/metrics`
: 'http://localhost:4318/v1/metrics',
});
const sdk = new NodeSDK({
resource,
spanProcessor: new BatchSpanProcessor(traceExporter, {
maxQueueSize: 2048,
maxExportBatchSize: 512,
scheduledDelayMillis: 5000,
exportTimeoutMillis: 30_000,
}),
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 15_000,
exportTimeoutMillis: 10_000,
}),
sampler: new ParentBasedSampler({
// Accept parent's sampling decision; for root spans, sample 10%
root: new TraceIdRatioBased(
Number(process.env.OTEL_TRACES_SAMPLER_ARG ?? '0.1')
),
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // noisy, disable by default
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) =>
// Suppress health check traces
req.url === '/health' || req.url === '/ready',
},
'@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: false }, // avoid logging SQL params
}),
],
});
sdk.start();
process.on('SIGTERM', async () => {
await sdk.shutdown();
process.exit(0);
});
process.on('SIGINT', async () => {
await sdk.shutdown();
process.exit(0);
});Custom metrics (Node.js)
// metrics.ts
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');
// Counter - value only goes up
export const ordersProcessed = meter.createCounter('orders_processed_total', {
description: 'Total number of orders processed',
unit: '{order}',
});
// Histogram - for latency and size distributions
export const orderProcessingDuration = meter.createHistogram(
'order_processing_duration_seconds',
{
description: 'Time to process an order end-to-end',
unit: 's',
advice: {
// Custom bucket boundaries for your latency profile
explicitBucketBoundaries: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
},
}
);
// Observable gauge - for values that are polled, not recorded on events
const activeConnections = meter.createObservableGauge('db_active_connections', {
description: 'Current active database connections',
});
activeConnections.addCallback((result) => {
result.observe(pool.totalCount - pool.idleCount, { pool: 'primary' });
});
// Usage
export async function processOrder(order: Order) {
const start = performance.now();
try {
const result = await doProcess(order);
ordersProcessed.add(1, { status: 'success', tier: order.tier });
return result;
} catch (err) {
ordersProcessed.add(1, { status: 'error', tier: order.tier });
throw err;
} finally {
orderProcessingDuration.record((performance.now() - start) / 1000, {
tier: order.tier,
});
}
}Python SDK
Installation
pip install \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-http \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-instrumentation-sqlalchemyFull SDK setup (Python / Flask)
# instrumentation.py - import this before your app module
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.sampling import ParentBasedTraceIdRatio
OTLP_ENDPOINT = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")
resource = Resource.create({
SERVICE_NAME: os.getenv("SERVICE_NAME", "unknown-service"),
SERVICE_VERSION: os.getenv("SERVICE_VERSION", "0.0.0"),
"deployment.environment": os.getenv("FLASK_ENV", "development"),
})
# Traces
sampler = ParentBasedTraceIdRatio(float(os.getenv("OTEL_TRACES_SAMPLER_ARG", "0.1")))
tracer_provider = TracerProvider(resource=resource, sampler=sampler)
tracer_provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(endpoint=f"{OTLP_ENDPOINT}/v1/traces"),
max_queue_size=2048,
max_export_batch_size=512,
schedule_delay_millis=5000,
)
)
trace.set_tracer_provider(tracer_provider)
# Metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint=f"{OTLP_ENDPOINT}/v1/metrics"),
export_interval_millis=15_000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)# app.py
import instrumentation # noqa: F401 - must be first import
from flask import Flask
from opentelemetry.instrumentation.flask import FlaskInstrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentation
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentation
app = Flask(__name__)
# Auto-instrumentation for Flask, outbound HTTP, and SQLAlchemy
FlaskInstrumentation().instrument_app(app)
RequestsInstrumentation().instrument()
SQLAlchemyInstrumentation().instrument(engine=db.engine)# Manual spans and custom metrics in Python
from opentelemetry import trace, metrics
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service", "1.0.0")
orders_counter = meter.create_counter(
"orders_processed_total",
description="Total orders processed",
unit="order",
)
order_duration = meter.create_histogram(
"order_processing_duration_seconds",
description="Order processing duration",
unit="s",
)
def process_payment(order_id: str, amount: float):
with tracer.start_as_current_span("payment.process") as span:
span.set_attributes({
"order.id": order_id,
"payment.amount": amount,
})
try:
result = stripe_client.charge(amount)
span.set_status(Status(StatusCode.OK))
return result
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raiseSampler reference
| Sampler | Description | Use case |
|---|---|---|
AlwaysOn |
Sample 100% of traces | Dev / low-traffic services |
AlwaysOff |
Sample nothing | Disable tracing without code change |
TraceIdRatioBased(0.1) |
Sample 10% of root spans deterministically | Production baseline |
ParentBasedSampler(root) |
Respect parent decision; use root for new traces |
Production (recommended) |
| Tail-based (Collector) | Collect all spans, decide after trace completes | Catching errors/slow traces |
Tail-based sampling requires the OTel Collector with the tailsampling processor.
Example Collector config:
# otel-collector.yaml (partial)
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic-baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }Exporter quick reference
| Backend | Exporter package | Endpoint format |
|---|---|---|
| OTel Collector (recommended) | exporter-trace-otlp-http |
http://collector:4318/v1/traces |
| Jaeger | exporter-jaeger or OTLP to Jaeger |
http://jaeger:14268 |
| Grafana Tempo | OTLP HTTP | http://tempo:4318 |
| Datadog | dd-trace (separate SDK) or OTel Collector with Datadog exporter |
https://trace.agent.datadoghq.com |
| Honeycomb | OTLP HTTP with API key header | https://api.honeycomb.io |
| New Relic | OTLP HTTP with license key header | https://otlp.nr-data.net:4318 |
Datadog note: Datadog's native dd-trace library has better Datadog-specific
features (APM, runtime metrics, profiling) than routing OTel through the Collector.
Use dd-trace if Datadog is your primary backend.
Environment variables reference
OTel SDKs respect these standard env vars:
# Service identity
OTEL_SERVICE_NAME=order-api
OTEL_SERVICE_VERSION=1.4.2
# Exporter endpoint (all signals)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
# Per-signal endpoint override
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics
# Auth headers for managed services (comma-separated key=value)
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY
# Sampling
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10%
# Log level for the OTel SDK itself (not your app)
OTEL_LOG_LEVEL=warnUsing env vars instead of hardcoding SDK config keeps your instrumentation code environment-agnostic - the same binary runs in dev (100% sampling, console exporter) and production (10% sampling, OTLP to Collector) with only env changes.
Frequently Asked Questions
What is observability?
Use this skill when implementing logging, metrics, distributed tracing, alerting, or defining SLOs. Triggers on structured logging, Prometheus, Grafana, OpenTelemetry, Datadog, distributed tracing, error tracking, dashboards, alert fatigue, SLIs, SLOs, error budgets, and any task requiring system observability or monitoring setup.
How do I install observability?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill observability in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support observability?
observability works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.