Troubleshooting Guide¶

Use this guide to diagnose and fix the most common obskit issues in development and production.

Quick Symptom → Fix Table¶

Symptom	Most Likely Cause	Quick Fix
No traces in Grafana Tempo	Wrong OTLP endpoint or port	Verify `OBSKIT_OTLP_ENDPOINT` and network connectivity
Metrics not scraped by Prometheus	Wrong port or path	Confirm `OBSKIT_METRICS_PORT=9090` and Prometheus `targets`
Health check always unhealthy	Dependency check timeout	Increase `OBSKIT_HEALTH_CHECK_TIMEOUT` or fix the failing dependency
No `trace_id` in log events	Tracing not initialised before logging	Call `setup_tracing()` before creating any logger
`ImportError: obskit.tracing`	Package not installed	`pip install "obskit[otlp]"`
High memory from metrics	Cardinality explosion	Bound label values; see cardinality section
Circuit breaker opens immediately	Failure threshold too low or external service down	Raise threshold or fix the downstream service
Spans missing in production	Sample rate too low	Increase `OBSKIT_TRACE_SAMPLE_RATE` temporarily
`configure()` call ignored	Called after first `get_settings()`	Call `configure()` before any obskit import
Logs in JSON but timestamps wrong	Log aggregator double-stamping	Set `OBSKIT_LOG_INCLUDE_TIMESTAMP=false`

Issue: No Traces Appearing in Grafana Tempo¶

Diagnosis steps¶

Bash

# 1. Verify OTLP endpoint resolves and port is open
python - <<'EOF'
import socket
host, port = "tempo", 4317
try:
    socket.setdefaulttimeout(3)
    socket.socket().connect((host, port))
    print(f"OK: {host}:{port} is reachable")
except Exception as e:
    print(f"FAIL: {e}")
EOF

# 2. Check obskit diagnostic output
python -m obskit.core.diagnose

# 3. Enable debug mode to print spans to stdout

Python

from obskit.tracing import setup_tracing

setup_tracing(
    service_name="order-service",
    debug=True,   # prints every span to stdout — never use in production
)

Sample debug output:

Text Only

[obskit] Span: POST /orders
  trace_id = 4bf92f3577b34da6a3ce929d0e0e4736
  span_id  = 00f067aa0ba902b7
  duration = 42.3 ms
  status   = OK
  attributes:
    http.method = POST
    http.route  = /orders
    http.status_code = 201

Common causes and fixes¶

Wrong OTLP endpointTLS mismatchSampling drops all tracessetup_tracing() not called

Bash

# Wrong
OBSKIT_OTLP_ENDPOINT=http://tempo:3200   # 3200 is Tempo HTTP API, not OTLP
OBSKIT_OTLP_ENDPOINT=http://tempo:9411   # 9411 is Zipkin format

# Correct
OBSKIT_OTLP_ENDPOINT=http://tempo:4317   # OTLP gRPC port

Bash

# If Tempo uses TLS but insecure flag is set
OBSKIT_OTLP_INSECURE=true   # WRONG for TLS endpoint
OBSKIT_OTLP_INSECURE=false  # correct for https/TLS endpoint

Bash

# Check your sample rate
OBSKIT_TRACE_SAMPLE_RATE=0.0  # drops 100% — nothing goes through
OBSKIT_TRACE_SAMPLE_RATE=1.0  # sample everything (dev/debug)

Python

# WRONG — tracing is never initialised
from obskit.logging import get_logger
logger = get_logger(__name__)

# CORRECT — call setup_tracing() at application startup
from obskit.tracing import setup_tracing
setup_tracing(service_name="order-service")  # must be before any logging

from obskit.logging import get_logger
logger = get_logger(__name__)

Issue: Metrics Not Scraped by Prometheus¶

Diagnosis steps¶

Bash

# 1. Confirm the metrics endpoint is up
curl -s http://localhost:9090/metrics | head -10

# 2. Check Prometheus targets page
open http://localhost:9091/targets   # substitute your Prometheus URL

# 3. Verify Prometheus config
cat prometheus.yml | grep -A 5 "order-service"

Common causes and fixes¶

Port mismatchMetrics server not startedAuth token not sent by PrometheusServiceMonitor label mismatch

YAML

# prometheus.yml — wrong port
scrape_configs:
  - job_name: order-service
    static_configs:
      - targets: ["order-service:8000"]   # API port, not metrics

# Correct — use the metrics port
scrape_configs:
  - job_name: order-service
    static_configs:
      - targets: ["order-service:9090"]   # OBSKIT_METRICS_PORT

If you use FastAPI with instrument_fastapi() or ObskitMiddleware, the metrics server on port 9090 starts automatically. For standalone scripts, you must start it manually:

Python

from obskit import configure_observability
from obskit.metrics import start_metrics_server

configure_observability(service_name="my-service", metrics_port=9090)
start_metrics_server()   # opens port 9090 in background thread

YAML

# prometheus.yml — add bearer_token when OBSKIT_METRICS_AUTH_ENABLED=true
scrape_configs:
  - job_name: order-service
    bearer_token: "your-secret-token"
    static_configs:
      - targets: ["order-service:9090"]

Bash

# Check that ServiceMonitor label matches Prometheus operator selector
kubectl get prometheus -n monitoring -o yaml | grep serviceMonitorSelector -A 5
# The label on your ServiceMonitor must match this selector

Issue: Health Check Always Returns Unhealthy¶

Diagnosis steps¶

Bash

# 1. Call health endpoint directly
curl -s http://localhost:8001/health/ready | python -m json.tool

# 2. Look at which specific check is failing
curl -s http://localhost:8001/health/ready | python -c "
import sys, json
data = json.load(sys.stdin)
for check, result in data.get('checks', {}).items():
    status = result.get('status', '?')
    msg    = result.get('message', '')
    print(f'  {status:10} {check}: {msg}')
"

Common causes and fixes¶

Database unreachableTimeout too shortRedis check fails intermittentlyCustom check raises unhandled exception

Python

from obskit.health import HealthChecker
from obskit.health.checks import DatabaseCheck

checker = HealthChecker()
checker.add_check(DatabaseCheck(
    name="postgres",
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    timeout=5.0,   # increase if DB is slow to respond
))

Check your connection string and that the DB container is in the same Docker network.

Bash

# Increase health check timeout
OBSKIT_HEALTH_CHECK_TIMEOUT=10.0   # default is 5.0 seconds

Python

from obskit.health.checks import RedisCheck

checker.add_check(RedisCheck(
    name="redis",
    url="redis://redis:6379",
    timeout=3.0,
    # Allow one retry before marking unhealthy
    retry_count=1,
))

Python

# Wrap custom checks defensively
from obskit.health import HealthCheck, CheckResult, CheckStatus

class MyCheck(HealthCheck):
    async def check(self) -> CheckResult:
        try:
            await self._do_check()
            return CheckResult(status=CheckStatus.HEALTHY)
        except Exception as exc:
            return CheckResult(
                status=CheckStatus.UNHEALTHY,
                message=f"Check failed: {exc}",
            )

Issue: Log Correlation Not Working (No `trace_id`)¶

obskit injects trace_id and span_id into every log event only when an active span exists in the current context.

Diagnosis steps¶

Python

from obskit.logging import get_logger
from obskit.tracing import setup_tracing

# Step 1: confirm tracing is initialised
setup_tracing(service_name="test", debug=True)

# Step 2: log inside a span
from opentelemetry import trace

logger = get_logger(__name__)
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("my-operation"):
    logger.info("inside span")   # should include trace_id and span_id

logger.info("outside span")      # will NOT have trace_id — this is correct

Common causes and fixes¶

setup_tracing() called after get_logger()Logging outside a span contextstructlog not using obskit processor

Python

# WRONG — logger captures context before tracing is active
from obskit.logging import get_logger
logger = get_logger(__name__)
from obskit.tracing import setup_tracing
setup_tracing(...)

# CORRECT — tracing first
from obskit.tracing import setup_tracing
setup_tracing(service_name="order-service")
from obskit.logging import get_logger
logger = get_logger(__name__)

Logs emitted outside a span will not have trace_id. This is the expected behaviour. Use the middleware or @with_span decorator to ensure spans wrap your request handlers.

If you configured structlog manually without obskit's processor chain, the injection does not occur. Use obskit's unified setup:

Python

# v1.0.0+ (recommended)
from obskit import configure_observability
obs = configure_observability(service_name="my-service")
# Logging, tracing, and metrics are all configured automatically.

# Legacy (still supported)
from obskit.logging.factory import configure_logging
configure_logging()   # sets up the full processor chain including OTel injection

Issue: obskit Tracing Not Installed But Getting Import Errors¶

Text Only

ModuleNotFoundError: No module named 'obskit.tracing'

Fix¶

Bash

# obskit tracing requires the otlp extra
pip install "obskit[otlp]"

# Or install the meta-package which includes everything
pip install obskit[all]

# Verify installation
python -c "from obskit.tracing import setup_tracing; print('OK')"

Conditional import pattern¶

If tracing is optional in your codebase:

Python

try:
    from obskit.tracing import setup_tracing
    _TRACING_AVAILABLE = True
except ImportError:
    _TRACING_AVAILABLE = False

if _TRACING_AVAILABLE:
    setup_tracing(service_name="order-service")

Issue: High Memory Usage from Metrics Cardinality¶

Prometheus cardinality explosions are one of the most common production problems. Each unique combination of label values creates a new time series in memory.

Diagnosis¶

Bash

# Check cardinality of all metrics
curl -s http://localhost:9090/metrics | python -c "
import sys, collections
series = collections.Counter()
for line in sys.stdin:
    if line.startswith('#') or not line.strip():
        continue
    metric = line.split('{')[0].strip()
    series[metric] += 1
for metric, count in series.most_common(10):
    print(f'{count:6} {metric}')
"

Common causes and fixes¶

Unbounded URL paths as labelsUser IDs or email in labelsobskit cardinality protection

Python

# WRONG — unique IDs in labels create cardinality explosion
# e.g. http_requests_total{path="/orders/uuid-1"}, {path="/orders/uuid-2"}, ...
metrics.request_count.labels(path=request.url.path)

# CORRECT — use route template, not concrete URL
metrics.request_count.labels(path="/orders/{order_id}")

Python

# NEVER do this — one series per user = millions of series
metrics.api_calls.labels(user_id=current_user.id)

# Use aggregated dimensions instead
metrics.api_calls.labels(plan="premium")

Enable the built-in cardinality guard:

Python

from obskit.metrics.cardinality import CardinalityGuard

guard = CardinalityGuard(
    max_series=10_000,    # alert if any metric exceeds this
    label_bounds={
        "status_code": {"200", "201", "400", "404", "429", "500", "503"},
        "method":      {"GET", "POST", "PUT", "PATCH", "DELETE"},
        "environment": {"production", "staging", "development"},
    },
)
guard.install()   # patches the default Prometheus registry

Any label value not in the allowed set is replaced with "__other__".

Debug Mode: Printing Spans¶

Use debug=True in setup_tracing() to print every span to stdout. This is invaluable when you cannot access Grafana:

Python

from obskit.tracing import setup_tracing

setup_tracing(
    service_name="order-service",
    debug=True,   # ConsoleSpanExporter — human-readable stdout
)

Never use debug=True in production

It bypasses the OTLP exporter and writes unstructured text to stdout, destroying log parsability and flooding your log aggregator.

python -m obskit.core.diagnose¶

Run the built-in diagnostic tool to get a complete picture of the effective configuration and connectivity:

Bash

python -m obskit.core.diagnose

Interpreting the output¶

Text Only

obskit v1.0.0 — Diagnostic Report (2026-02-28T10:30:00Z)
==========================================================

Service
  name        : order-service          ✓
  environment : production             ✓
  version     : 2.1.0                  ✓

Tracing
  enabled     : True
  endpoint    : http://tempo:4317
  reachable   : True                   ✓
  sample_rate : 0.1
  insecure    : False                  ✓

Metrics
  enabled     : True
  port        : 9090
  listening   : True                   ✓
  path        : /metrics

Logging
  level       : INFO
  format      : json
  backend     : structlog              ✓

Health
  timeout     : 5.0 s

Packages installed
  obskit               3.0.0          ✓
  prometheus           ✓
  otlp                 ✓
  fastapi              ✓

Validation   : PASS

Warning indicators¶

Symbol	Meaning
`✓`	Check passed
`!`	Warning — non-fatal issue
`✗`	Error — action required

Log Output Format Debugging¶

If your log aggregator (Loki, Elasticsearch, Splunk) cannot parse obskit JSON logs:

Bash

# Test log output format
python - <<'EOF'
import os
os.environ["OBSKIT_LOG_FORMAT"] = "json"
os.environ["OBSKIT_LOG_LEVEL"] = "DEBUG"

from obskit.logging import get_logger
logger = get_logger("debug-test")
logger.info("test event", key="value", number=42)
EOF

Expected JSON output:

JSON

{
  "timestamp": "2026-02-28T10:30:00.123456Z",
  "level": "info",
  "logger": "debug-test",
  "event": "test event",
  "key": "value",
  "number": 42,
  "service": "unknown",
  "environment": "development",
  "version": "0.0.0"
}

If you see console format instead, check:

OBSKIT_LOG_FORMAT environment variable is correctly set
No other code is calling logging.basicConfig() before obskit initialises
The logging_backend is "structlog" (loguru uses a different renderer)

Performance Profiling with Benchmarks¶

If you suspect obskit is adding latency to your hot path, run the micro-benchmarks:

Bash

# Install benchmark dependencies
pip install pytest-benchmark memory-profiler

# Run all benchmarks
cd benchmarks/
pytest bench_metrics.py bench_context.py -v

# Profile memory allocation
python bench_memory.py

# Run the macro benchmark (full stack)
python macro_runner.py --duration 60 --concurrency 50

Expected baseline numbers on a modern laptop (M2 MacBook Pro):

Operation	p50	p99
`logger.info()`	4 µs	12 µs
`counter.inc()`	0.8 µs	2 µs
`histogram.observe()`	1.2 µs	4 µs
`setup_tracing()` (startup, once)	50 ms	—
Context propagation per span	5 µs	20 µs

If your numbers are significantly higher, check:

Log format: "json" is slower than "console" — normal for production but avoid in tight loops.
Trace sample rate: 1.0 in production with high throughput creates export back-pressure. Drop to 0.1.
Async queue depth: If async_metric_queue_size is too small, the queue fills and drops events; too large causes GC pressure.

Bash

# Profile a specific hot path
python -c "
import cProfile, pstats, io
from obskit.logging import get_logger

logger = get_logger('bench')
pr = cProfile.Profile()
pr.enable()
for _ in range(100_000):
    logger.info('event', x=1)
pr.disable()

s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats('cumulative').print_stats(20)
print(s.getvalue())
"

Getting Help¶

If none of the above resolves your issue:

Run python -m obskit.core.diagnose and include the output in your bug report.
Check the GitHub Issues for similar reports.
Open a new issue with: obskit version, Python version, OS, minimal reproduction script, and the diagnose output.

obskit self-metrics

When OBSKIT_ENABLE_SELF_METRICS=true, obskit exposes its own internal metrics at the same /metrics endpoint:

PromQL

# Async metric queue depth (alert if > 80 % full)
obskit_async_metric_queue_depth / obskit_async_metric_queue_capacity

# Spans dropped due to full export queue
rate(obskit_spans_dropped_total[1m])

# OTLP export errors
rate(obskit_otlp_export_errors_total[5m])

If obskit_spans_dropped_total is rising, increase OBSKIT_TRACE_EXPORT_QUEUE_SIZE or reduce your sample rate.

Structured log search for common errors

Search Loki or Elasticsearch for obskit-emitted error events:

Text Only

# Loki query for slow requests (> 1 s)
{app="order-service"} | json | duration_ms > 1000

# Loki query for all ERROR level events
{app="order-service"} | json | level = "error"

High obskit_async_metric_queue_depth

If this metric consistently exceeds 80 % of capacity, your metric recording rate exceeds the export rate. Remedies in order of preference:

Reduce OBSKIT_METRICS_SAMPLE_RATE for high-frequency paths
Increase OBSKIT_ASYNC_METRIC_QUEUE_SIZE (uses more memory)
Reduce histogram bucket count (fewer buckets = faster processing)

Troubleshooting Guide¶

Quick Symptom → Fix Table¶

Issue: No Traces Appearing in Grafana Tempo¶

Diagnosis steps¶

Common causes and fixes¶

Issue: Metrics Not Scraped by Prometheus¶

Diagnosis steps¶

Common causes and fixes¶

Issue: Health Check Always Returns Unhealthy¶

Diagnosis steps¶

Common causes and fixes¶

Issue: Log Correlation Not Working (No trace_id)¶

Diagnosis steps¶

Common causes and fixes¶

Issue: obskit Tracing Not Installed But Getting Import Errors¶

Fix¶

Conditional import pattern¶

Issue: High Memory Usage from Metrics Cardinality¶

Diagnosis¶

Common causes and fixes¶

Debug Mode: Printing Spans¶

python -m obskit.core.diagnose¶

Interpreting the output¶

Warning indicators¶

Log Output Format Debugging¶

Performance Profiling with Benchmarks¶

Getting Help¶

Issue: Log Correlation Not Working (No `trace_id`)¶