Skip to content

Production Readiness Checklist

Use this checklist before every production deployment of a service instrumented with obskit v1.0.0. Each section maps to a specific obskit package and the decisions you must make before going live.


1. Configuration

  • [ ] OBSKIT_SERVICE_NAME set to a meaningful, unique name (not "unknown")
  • [ ] OBSKIT_ENVIRONMENT=production set explicitly
  • [ ] OBSKIT_VERSION injected from CI/CD (git tag or image tag)
  • [ ] configure_observability() (v1.0.0+) or configure() called before any other obskit import
  • [ ] Configuration validated at startup; fails fast if invalid
  • [ ] No secrets in ConfigMaps — all sensitive values in Kubernetes Secrets or Vault
  • [ ] .env file excluded from Docker image (COPY excludes it or .dockerignore entry)
Python
# main.py — configuration at the top, before other obskit imports
import os
from obskit import configure_observability

# v1.0.0+: single call configures logging, tracing, and metrics
obs = configure_observability(
    service_name=os.environ["SERVICE_NAME"],
    environment=os.environ["DEPLOY_ENV"],
    version=os.environ["APP_VERSION"],
)

# Legacy approach (still supported):
# from obskit import configure
# from obskit.config import validate_config
# configure(service_name=..., environment=..., version=...)
# is_valid, errors = validate_config()

2. Logging

  • [ ] OBSKIT_LOG_FORMAT=json — structured JSON for all production deployments
  • [ ] OBSKIT_LOG_LEVEL=INFO — avoid DEBUG in production (high volume, potential PII leakage)
  • [ ] OBSKIT_LOG_INCLUDE_TIMESTAMP=true unless your log aggregator adds its own
  • [x] PII scrubbing — automatic — the default get_logger() pipeline includes make_redaction_processor() which redacts password, token, secret, api_key, authorization, card_number, and 15+ other sensitive field names before any output is written. No setup required.
  • [ ] Log sampling rate set (OBSKIT_LOG_SAMPLE_RATE) for high-frequency paths
  • [ ] Log aggregator (Loki, Elasticsearch) confirmed to parse the JSON format
  • [ ] Correlation fields (trace_id, span_id) appear in log events from traced requests
Python
from obskit import configure_observability

obs = configure_observability(service_name="my-service", log_format="json")

# Verify trace injection works
log = obs.logger
log.info("startup complete", phase="init")
# JSON output should include: trace_id, span_id, service, environment, version

OTLP log export

Use configure_otlp_logging() for sending structured logs to an OTLP collector. The OTLPLogHandler class is a Python logging.Handler adapter that also exports via the same OTel pipeline when added to logging.getLogger().


3. Metrics

  • [ ] OBSKIT_METRICS_ENABLED=true
  • [ ] OBSKIT_METRICS_PORT=9090 accessible to Prometheus (NetworkPolicy allows scraping)
  • [ ] OBSKIT_METRICS_METHOD=red (or golden / all based on team agreement)
  • [ ] Histogram buckets reviewed — default covers 1 ms – 10 s; widen if your p99 exceeds 10 s
  • [ ] Cardinality guard installed with bounded label values
  • [ ] No user IDs, emails, or high-cardinality values in metric labels
  • [ ] OBSKIT_METRICS_AUTH_ENABLED=true with token stored in Secret
  • [ ] Prometheus ServiceMonitor or scrape_config validated (targets shows UP)
  • [ ] Self-metrics enabled (OBSKIT_ENABLE_SELF_METRICS=true) to monitor obskit queue depth
Python
from obskit.metrics.cardinality import CardinalityGuard

CardinalityGuard(
    max_series=10_000,
    label_bounds={
        "http_method":   {"GET", "POST", "PUT", "PATCH", "DELETE", "HEAD"},
        "http_status":   {"200", "201", "204", "400", "401", "403", "404",
                          "422", "429", "500", "502", "503", "504"},
        "environment":   {"production", "staging", "development"},
    },
).install()

4. Tracing

  • [ ] OBSKIT_TRACING_ENABLED=true
  • [ ] OBSKIT_OTLP_ENDPOINT points to a reachable collector (Tempo, Jaeger, Collector)
  • [ ] OBSKIT_OTLP_INSECURE=false — TLS enforced in production
  • [ ] Sample rate configured for traffic volume (see sampling strategy below)
  • [ ] configure_observability() (or legacy setup_tracing()) called before any request handler runs
  • [ ] W3C traceparent header propagated through all HTTP calls (verify with curl)
  • [ ] Span attributes do not contain PII (db.statement sanitised, request body excluded)
  • [ ] Tempo / Jaeger UI shows complete traces end-to-end

5. Health Checks

  • [ ] /health/live — liveness probe returns 200 (simple alive check)
  • [ ] /health/ready — readiness probe checks all critical dependencies
  • [ ] /health/startup — startup probe used with failureThreshold=30, periodSeconds=3
  • [ ] All critical dependencies (DB, Redis, external APIs) registered as health checks
  • [ ] OBSKIT_HEALTH_CHECK_TIMEOUT set lower than Kubernetes probe timeoutSeconds
  • [ ] Readiness returns 503 when any critical check is unhealthy
  • [ ] Non-critical checks (e.g., analytics service) marked as warning not critical
Python
from obskit.health import HealthChecker
from obskit.health.checks import DatabaseCheck, RedisCheck, HTTPCheck

checker = HealthChecker()
checker.add_check(DatabaseCheck("postgres", db_url, timeout=3.0, critical=True))
checker.add_check(RedisCheck("redis", redis_url, timeout=2.0, critical=True))
checker.add_check(HTTPCheck("payment-api", "https://api.payments.com/health",
                             timeout=5.0, critical=False))  # non-critical

6. External Calls

  • [ ] Timeout set on every HTTP client call (never use default unlimited timeout)
  • [ ] Retry logic implemented for idempotent operations
  • [ ] Fallback responses defined for dependency failures
Python
import httpx
from obskit.logging import get_logger

log = get_logger(__name__)

async def charge_card(amount: float) -> dict:
    async with httpx.AsyncClient(timeout=10.0) as client:
        resp = await client.post("https://payments.acme.com/charge", json={"amount": amount})
        resp.raise_for_status()
        return resp.json()

7. SLO Definitions

Define SLOs before going live so you have a baseline from day one:

  • [ ] Error rate SLO defined (e.g., 99.9 % of requests succeed)
  • [ ] Latency SLO defined (e.g., p99 < 500 ms)
  • [ ] SLO burn rate alerts configured in Alertmanager
  • [ ] Error budget dashboard available in Grafana
Python
from obskit.slo import SLOTracker

slo = SLOTracker(
    name="order-service-availability",
    target=0.999,   # 99.9 %
    window_days=30,
)

Sampling Strategy for High-Traffic Services

Choosing the right sample rate avoids two failure modes:

  • Too high (1.0 in production): overwhelms the collector and storage; drives up cost
  • Too low (0.001): you miss tail-latency issues and rare errors
Requests / second Recommended OBSKIT_TRACE_SAMPLE_RATE Effective traces / hour
< 10 1.0 (100 %) up to 36 000
10 – 100 0.1 (10 %) up to 36 000
100 – 1 000 0.01 (1 %) up to 36 000
> 1 000 0.0010.01 calibrate to storage budget

Tail-based sampling

For critical services where you want 100 % of errors captured regardless of rate, use tail-based sampling in the OpenTelemetry Collector:

YAML
# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100_000
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      # Sample 1% of everything else
      - name: base-rate
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

Cardinality Limits

Prometheus stores one time series per unique label-value combination. Keep cardinality under control:

Metric type Max recommended series Action if exceeded
Per-endpoint latency 200 (10 routes × 5 methods × 4 status buckets) Aggregate similar routes
Per-user metrics Never Use per-plan / per-tier instead
Per-tenant metrics 1 000 Use CardinalityGuard with allow-list
Background job metrics 50 One metric per job type, not per instance

Cardinality alert rule

YAML
# prometheus-rules.yaml
groups:
  - name: obskit-cardinality
    rules:
      - alert: MetricsCardinalityHigh
        expr: |
          count by (metric_name) (
            label_replace(
              {__name__=~".*"},
              "metric_name", "$1", "__name__", "(.*)"
            )
          ) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High cardinality metric: {{ $labels.metric_name }}"
          description: "{{ $value }} series  investigate label values"

Circuit Breaker Tuning

Parameter Conservative Balanced Aggressive
failure_threshold 10 5 3
recovery_timeout 60 s 30 s 10 s
half_open_requests 5 3 1

Conservative: Use for dependencies that are slow to recover (databases, third-party APIs with rate limits).

Balanced: Default; suitable for most internal microservice calls.

Aggressive: Use for low-latency dependencies where fast failure detection is critical (caches).


Memory and CPU Considerations

Component Memory impact CPU impact Mitigation
Prometheus histograms ~1 KB per series Negligible Use cardinality guard
Trace export queue 2048 × ~2 KB = ~4 MB Low (background goroutine-equivalent) Tune OBSKIT_TRACE_EXPORT_QUEUE_SIZE
Async metric queue 10 000 × ~0.5 KB = ~5 MB Low Reduce if memory-constrained
structlog JSON renderer Negligible ~4 µs per log event Use log sampling at high volume

Graceful Shutdown

obskit registers shutdown hooks automatically when you call configure_observability() (or the legacy setup_tracing() and start_metrics_server()). For custom cleanup:

Python
import signal
import asyncio
from obskit.shutdown import ShutdownManager

manager = ShutdownManager()

@manager.on_shutdown
async def flush_traces():
    """Ensure all pending spans are exported before process exits."""
    from opentelemetry import trace
    provider = trace.get_tracer_provider()
    if hasattr(provider, "force_flush"):
        provider.force_flush(timeout_millis=5000)

@manager.on_shutdown
async def flush_metrics():
    """Final metrics push."""
    # Prometheus scrape will happen naturally; nothing to flush
    pass

# Register SIGTERM handler (Kubernetes sends SIGTERM before SIGKILL)
def handle_sigterm(signum, frame):
    asyncio.create_task(manager.shutdown())

signal.signal(signal.SIGTERM, handle_sigterm)

In Kubernetes, always set terminationGracePeriodSeconds to at least 30 seconds so in-flight traces have time to export:

YAML
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

Rollback Procedure

If a deployment causes metric or tracing anomalies:

Bash
# 1. Rollback the deployment
kubectl rollout undo deployment/order-service -n production

# 2. Verify rollback is complete
kubectl rollout status deployment/order-service -n production

# 3. Check that old version metrics are flowing
curl -s http://localhost:9090/metrics | grep 'service_version'

# 4. Verify health checks recover
kubectl get pods -n production -l app=order-service
# All pods should show 2/2 Running with READY state

# 5. Confirm Grafana dashboards return to baseline
# Check: error rate, p99 latency, SLO burn rate

Alerting Setup

Minimum alert set before going live:

YAML
# prometheus-rules.yaml
groups:
  - name: obskit-production
    rules:

      # Service availability — SLO breach
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1 % for 2 minutes"

      # Latency — p99 SLO breach
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency > 500 ms"

      # Health check failure
      - alert: ServiceUnhealthy
        expr: up{job="order-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service health check failing"

      # SLO burn rate high
      - alert: SLOBurnRateHigh
        expr: |
          slo_error_budget_remaining{service="order-service"} < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO error budget below 10% for {{ $labels.service }}"

Dashboard Templates

Import the following obskit Grafana dashboard templates before go-live:

Dashboard Purpose UID
obskit — RED Metrics Request rate, error rate, duration per endpoint obskit-red
obskit — SLO Burn Rate Error budget consumption over 30-day window obskit-slo
obskit — Circuit Breakers State transitions, failure counts obskit-cb
obskit — Health Overview All health check statuses across services obskit-health
obskit — Self-Metrics obskit queue depth, export latency obskit-self

Grafana dashboard JSON files are located in the dashboards/ directory of the obskit repository.


You are production-ready when...

Every item in sections 1–7 above is checked. Dashboards are imported, alerts are configured, and at least one full end-to-end trace is visible in Grafana Tempo before your first real user request.

Common production anti-patterns

  • Deploying with OBSKIT_ENVIRONMENT=development — alerts may be silenced for non-production environments
  • Setting OBSKIT_TRACE_SAMPLE_RATE=1.0 with > 100 req/s — this will overwhelm the collector
  • Using log_format=console in production — Loki cannot parse human-readable output
  • Hardcoding OBSKIT_METRICS_AUTH_TOKEN in the Deployment manifest — it will appear in kubectl describe

Automate this checklist

Add the configuration validation call to your CI pipeline smoke test:

Python
# tests/smoke/test_production_config.py
import os, pytest

@pytest.mark.skipif(
    os.getenv("OBSKIT_ENVIRONMENT") != "production",
    reason="Production config check only"
)
def test_production_config_valid():
    from obskit.config import validate_config
    is_valid, errors = validate_config()
    assert is_valid, f"Production config errors: {errors}"

Environment-specific Configuration Summary

Bash
OBSKIT_ENVIRONMENT=development
OBSKIT_LOG_FORMAT=console       # readable in terminal
OBSKIT_LOG_LEVEL=DEBUG
OBSKIT_TRACE_SAMPLE_RATE=1.0    # capture everything
OBSKIT_OTLP_INSECURE=true
OBSKIT_METRICS_AUTH_ENABLED=false
Bash
OBSKIT_ENVIRONMENT=staging
OBSKIT_LOG_FORMAT=json
OBSKIT_LOG_LEVEL=INFO
OBSKIT_TRACE_SAMPLE_RATE=0.5    # 50 % — find issues before prod
OBSKIT_OTLP_INSECURE=false
OBSKIT_METRICS_AUTH_ENABLED=true
Bash
OBSKIT_ENVIRONMENT=production
OBSKIT_LOG_FORMAT=json
OBSKIT_LOG_LEVEL=INFO
OBSKIT_TRACE_SAMPLE_RATE=0.1    # tune per traffic volume
OBSKIT_OTLP_INSECURE=false
OBSKIT_METRICS_AUTH_ENABLED=true
OBSKIT_CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
OBSKIT_RETRY_MAX_ATTEMPTS=3