How to Trace Health Checks¶

When an OpenTelemetry span is active during a health check request, HealthResult.to_dict() automatically includes trace_id and span_id in the response payload. This creates a direct link between a failing health check and the exact trace in Grafana Tempo — so you can see why the check failed, not just that it failed.

What is health check tracing?¶

A health check is typically a lightweight HTTP endpoint (/health or /healthz) that aggregates the status of several sub-checks (database connectivity, downstream service reachability, SLO compliance, etc.). When a check fails in production, you usually only see a binary degraded / unhealthy status.

With health check tracing:

Each call to HealthChecker.check_health() runs inside an OTel span.
The span carries the full check execution as child spans.
HealthResult.to_dict() embeds the trace_id of that span in the JSON response.
Alertmanager or your on-call tooling can follow the trace_id link directly to the Tempo trace, showing exactly which sub-check failed, how long it took, and what error was raised.

Requirements¶

Package	Minimum version	Role
`obskit`	1.0.0	`HealthChecker`, `HealthResult`, built-in checks
`obskit[otlp]`	1.0.0	OTel span context injection
`opentelemetry-sdk`	1.20.0	Active span context

Installation¶

Bash

pip install "obskit[otlp]"

Basic example: with and without tracing¶

Without tracingWith tracing

Python

import asyncio
from obskit.health import HealthChecker

checker = HealthChecker(service_name="order-service", version="2.0.0")
result = asyncio.run(checker.check_health())
print(result.to_dict())

Output

JSON

{
  "status": "healthy",
  "healthy": true,
  "service": "order-service",
  "version": "2.0.0",
  "timestamp": "2026-02-28T10:00:00.000000+00:00",
  "checks": {}
}

Python

import asyncio
from obskit.health import HealthChecker
from obskit.tracing import setup_tracing
from opentelemetry import trace

setup_tracing(service_name="order-service", exporter_endpoint="http://localhost:4317")
tracer = trace.get_tracer("order_service")

checker = HealthChecker(service_name="order-service", version="2.0.0")

async def run():
    with tracer.start_as_current_span("health_check"):
        result = await checker.check_health()
        return result.to_dict()

print(asyncio.run(run()))

Output

JSON

{
  "status": "healthy",
  "healthy": true,
  "service": "order-service",
  "version": "2.0.0",
  "timestamp": "2026-02-28T10:00:00.000000+00:00",
  "checks": {},
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

The trace_id field is only present when a valid OTel span is active. When tracing is not configured, the field is omitted silently — no errors, no warnings.

FastAPI /health endpoint with trace context¶

The most common pattern is a FastAPI route that wraps the health check in a span so that the trace_id is always included in the response:

Python

from fastapi import FastAPI
from obskit import configure_observability, instrument_fastapi
from obskit.health import HealthChecker, create_http_check
from opentelemetry import trace

configure_observability(
    service_name="order-service",
    otlp_endpoint="http://localhost:4317",
)

app = FastAPI()
instrument_fastapi(app)

tracer = trace.get_tracer("order_service")
checker = HealthChecker(service_name="order-service", version="2.0.0")

# Register checks
checker.add_check(
    "payments-api",
    create_http_check("https://payments.internal/health", timeout=2.0),
)


@app.get("/health")
async def health():
    # The ObskitMiddleware already started a span for this HTTP request,
    # so HealthResult.to_dict() will include trace_id automatically.
    result = await checker.check_health()
    status_code = 200 if result.healthy else 503
    from fastapi.responses import JSONResponse
    return JSONResponse(content=result.to_dict(), status_code=status_code)

Example healthy response:

JSON

{
  "status": "healthy",
  "healthy": true,
  "service": "order-service",
  "version": "2.0.0",
  "timestamp": "2026-02-28T10:00:00.000000+00:00",
  "checks": {
    "payments-api": {"status": "healthy", "latency_ms": 12}
  },
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

Example degraded response (HTTP 503):

JSON

{
  "status": "degraded",
  "healthy": false,
  "service": "order-service",
  "version": "2.0.0",
  "timestamp": "2026-02-28T10:00:00.000000+00:00",
  "checks": {
    "payments-api": {
      "status": "unhealthy",
      "error": "Connection refused",
      "latency_ms": 2003
    }
  },
  "trace_id": "9a3fc1d2e4b50a17c83e2f9600b1d8e5",
  "span_id": "b3f8a1c200e40912"
}

The trace_id in the 503 response points directly to the Tempo trace that shows the payments-api check timing out.

SLO-based health check¶

You can add an SLO compliance check that marks the service as degraded when the error budget is nearly exhausted:

Python

from obskit.health import HealthChecker
from obskit.health.slo_check import create_slo_health_check
from obskit.slo import SLOTracker

tracker = SLOTracker(
    name="order-success-rate",
    target=0.999,
    window_seconds=3600,
)

checker = HealthChecker(service_name="order-service", version="2.0.0")
checker.add_check(
    "slo-compliance",
    create_slo_health_check(tracker, degraded_threshold=0.10),  # 10 % budget remaining
)

When the SLO budget drops below 10 %, the health check returns degraded — and the trace_id in the response points to the Tempo trace that shows the SLO tracker state at the moment of the check.

Kubernetes liveness and readiness probes¶

Health check tracing is especially useful for debugging flapping Kubernetes pods. When a pod is restarted because the liveness probe returned non-200, you normally lose the context of what failed. With trace_id in the response, you can query Tempo for the exact trace before the restart.

YAML

# kubernetes/deployment.yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

When the probe fails, Kubernetes logs the HTTP response body. Because the response includes trace_id, you can go directly to:

Text Only

http://localhost:3000/explore?orgId=1&left={"datasource":"Tempo","queries":[{"query":"<trace_id>"}]}

Using trace_id in Alertmanager webhooks¶

Configure an Alertmanager receiver that captures the trace_id from the /health response and includes it in the alert annotation:

YAML

# alertmanager.yml
receivers:
  - name: "health-check-alerts"
    webhook_configs:
      - url: "http://alert-enricher.internal/enrich"
        send_resolved: true
        http_config:
          follow_redirects: true

In the alert-enricher service, when you receive a HealthCheck alert, call your /health endpoint, extract trace_id from the JSON, and annotate the alert:

Python

import httpx

async def enrich_health_alert(alert: dict) -> dict:
    async with httpx.AsyncClient() as client:
        response = await client.get("http://order-service.internal/health")
        body = response.json()
        trace_id = body.get("trace_id", "")

    if trace_id:
        alert["annotations"]["trace_url"] = (
            f"http://grafana.internal/explore?left="
            f'[%22now-1h%22,%22now%22,%22Tempo%22,{{"query":"{trace_id}"}}]'
        )
    return alert

The resulting alert in Slack / PagerDuty contains a direct Tempo link for the failing health check.

Grafana: annotating dashboards with health check events¶

Use Grafana Annotations to mark health degradation events on your service dashboards with a link to the trace.

Create an Annotation query on your dashboard that queries Prometheus:

PromQL

changes(obskit_health_status{service="order-service"}[1m]) > 0

In the Annotation configuration, set the Text template to include the trace URL:

Text Only

Health check degraded — trace: http://tempo:3200/trace/${__field.labels.trace_id}

Each annotation marker on the dashboard timeline now carries a clickable trace link.

Async health checks with context propagation¶

If your health check functions are async and spawn sub-tasks, the OTel context propagates automatically through asyncio coroutines. Each async check function will therefore run within a child span of the main health check span:

Python

import asyncio
import httpx
from obskit.health import HealthChecker

async def check_redis(name: str) -> dict:
    """A custom async check that tests Redis connectivity."""
    try:
        import aioredis
        redis = await aioredis.from_url("redis://localhost:6379")
        await redis.ping()
        await redis.close()
        return {"status": "healthy"}
    except Exception as exc:
        return {"status": "unhealthy", "error": str(exc)}


async def check_downstream_api(name: str) -> dict:
    """A custom async check for a downstream HTTP dependency."""
    try:
        async with httpx.AsyncClient(timeout=2.0) as client:
            resp = await client.get("https://payments.internal/ping")
            resp.raise_for_status()
        return {"status": "healthy"}
    except Exception as exc:
        return {"status": "unhealthy", "error": str(exc)}


checker = HealthChecker(service_name="order-service", version="2.0.0")
checker.add_check("redis", check_redis)
checker.add_check("payments-api", check_downstream_api)

# All async checks run concurrently, each inheriting the OTel span context.
result = asyncio.run(checker.check_health())

Concurrent checks and trace context

HealthChecker.check_health() runs all registered checks concurrently with asyncio.gather(). Each check coroutine inherits the active OTel span context via contextvars, so all checks appear as sibling child spans in the Tempo trace view.

Troubleshooting¶

"trace_id is missing from /health response"¶

Is obskit[otlp] installed?

Bash

python -m obskit.core.diagnose

Is a span active during the health check?

The ObskitMiddleware creates a span for every incoming HTTP request, including GET /health. If you are not using the middleware, wrap the check_health() call in a manual span:

Python

with tracer.start_as_current_span("health_check"):
    result = await checker.check_health()

Is setup_tracing() called at application startup?

The OTel TracerProvider must be initialised before the first request is served.

"Health check always shows trace_id=000000... (all zeros)"¶

The OTel SDK is installed but no exporter is configured, so spans use a no-op provider that generates an invalid (all-zeros) trace context. Pass a valid exporter_endpoint to setup_tracing().

"Kubernetes probe response body does not contain trace_id"¶

Kubernetes does not log the HTTP response body by default for probes. To capture it, use a lifecycle.postStart or preStop hook script that calls the health endpoint, or configure your application to write health check results to a file that Kubernetes can retrieve with kubectl exec.

Summary¶

Step	Action
Install	`pip install "obskit[otlp]"`
Initialise	Call `setup_tracing()` at startup
Middleware	Add `ObskitMiddleware` to wrap requests in spans
Endpoint	Return `result.to_dict()` from your `/health` route
Verify	`trace_id` appears in the JSON response body
Grafana	Follow `trace_id` link to Tempo for degraded checks