Troubleshooting Guide¶
Use this guide to diagnose and fix the most common obskit issues in development and production.
Quick Symptom → Fix Table¶
| Symptom | Most Likely Cause | Quick Fix |
|---|---|---|
| No traces in Grafana Tempo | Wrong OTLP endpoint or port | Verify OBSKIT_OTLP_ENDPOINT and network connectivity |
| Metrics not scraped by Prometheus | Wrong port or path | Confirm OBSKIT_METRICS_PORT=9090 and Prometheus targets |
| Health check always unhealthy | Dependency check timeout | Increase OBSKIT_HEALTH_CHECK_TIMEOUT or fix the failing dependency |
No trace_id in log events |
Tracing not initialised before logging | Call setup_tracing() before creating any logger |
ImportError: obskit.tracing |
Package not installed | pip install "obskit[otlp]" |
| High memory from metrics | Cardinality explosion | Bound label values; see cardinality section |
| Circuit breaker opens immediately | Failure threshold too low or external service down | Raise threshold or fix the downstream service |
| Spans missing in production | Sample rate too low | Increase OBSKIT_TRACE_SAMPLE_RATE temporarily |
configure() call ignored |
Called after first get_settings() |
Call configure() before any obskit import |
| Logs in JSON but timestamps wrong | Log aggregator double-stamping | Set OBSKIT_LOG_INCLUDE_TIMESTAMP=false |
Issue: No Traces Appearing in Grafana Tempo¶
Diagnosis steps¶
# 1. Verify OTLP endpoint resolves and port is open
python - <<'EOF'
import socket
host, port = "tempo", 4317
try:
socket.setdefaulttimeout(3)
socket.socket().connect((host, port))
print(f"OK: {host}:{port} is reachable")
except Exception as e:
print(f"FAIL: {e}")
EOF
# 2. Check obskit diagnostic output
python -m obskit.core.diagnose
# 3. Enable debug mode to print spans to stdout
from obskit.tracing import setup_tracing
setup_tracing(
service_name="order-service",
debug=True, # prints every span to stdout — never use in production
)
Sample debug output:
[obskit] Span: POST /orders
trace_id = 4bf92f3577b34da6a3ce929d0e0e4736
span_id = 00f067aa0ba902b7
duration = 42.3 ms
status = OK
attributes:
http.method = POST
http.route = /orders
http.status_code = 201
Common causes and fixes¶
# Wrong
OBSKIT_OTLP_ENDPOINT=http://tempo:3200 # 3200 is Tempo HTTP API, not OTLP
OBSKIT_OTLP_ENDPOINT=http://tempo:9411 # 9411 is Zipkin format
# Correct
OBSKIT_OTLP_ENDPOINT=http://tempo:4317 # OTLP gRPC port
# If Tempo uses TLS but insecure flag is set
OBSKIT_OTLP_INSECURE=true # WRONG for TLS endpoint
OBSKIT_OTLP_INSECURE=false # correct for https/TLS endpoint
# Check your sample rate
OBSKIT_TRACE_SAMPLE_RATE=0.0 # drops 100% — nothing goes through
OBSKIT_TRACE_SAMPLE_RATE=1.0 # sample everything (dev/debug)
# WRONG — tracing is never initialised
from obskit.logging import get_logger
logger = get_logger(__name__)
# CORRECT — call setup_tracing() at application startup
from obskit.tracing import setup_tracing
setup_tracing(service_name="order-service") # must be before any logging
from obskit.logging import get_logger
logger = get_logger(__name__)
Issue: Metrics Not Scraped by Prometheus¶
Diagnosis steps¶
# 1. Confirm the metrics endpoint is up
curl -s http://localhost:9090/metrics | head -10
# 2. Check Prometheus targets page
open http://localhost:9091/targets # substitute your Prometheus URL
# 3. Verify Prometheus config
cat prometheus.yml | grep -A 5 "order-service"
Common causes and fixes¶
# prometheus.yml — wrong port
scrape_configs:
- job_name: order-service
static_configs:
- targets: ["order-service:8000"] # API port, not metrics
# Correct — use the metrics port
scrape_configs:
- job_name: order-service
static_configs:
- targets: ["order-service:9090"] # OBSKIT_METRICS_PORT
If you use FastAPI with instrument_fastapi() or ObskitMiddleware, the metrics server on port 9090 starts automatically. For standalone scripts, you must start it manually:
from obskit import configure_observability
from obskit.metrics import start_metrics_server
configure_observability(service_name="my-service", metrics_port=9090)
start_metrics_server() # opens port 9090 in background thread
# prometheus.yml — add bearer_token when OBSKIT_METRICS_AUTH_ENABLED=true
scrape_configs:
- job_name: order-service
bearer_token: "your-secret-token"
static_configs:
- targets: ["order-service:9090"]
# Check that ServiceMonitor label matches Prometheus operator selector
kubectl get prometheus -n monitoring -o yaml | grep serviceMonitorSelector -A 5
# The label on your ServiceMonitor must match this selector
Issue: Health Check Always Returns Unhealthy¶
Diagnosis steps¶
# 1. Call health endpoint directly
curl -s http://localhost:8001/health/ready | python -m json.tool
# 2. Look at which specific check is failing
curl -s http://localhost:8001/health/ready | python -c "
import sys, json
data = json.load(sys.stdin)
for check, result in data.get('checks', {}).items():
status = result.get('status', '?')
msg = result.get('message', '')
print(f' {status:10} {check}: {msg}')
"
Common causes and fixes¶
from obskit.health import HealthChecker
from obskit.health.checks import DatabaseCheck
checker = HealthChecker()
checker.add_check(DatabaseCheck(
name="postgres",
connection_string="postgresql://user:pass@localhost:5432/mydb",
timeout=5.0, # increase if DB is slow to respond
))
Check your connection string and that the DB container is in the same Docker network.
# Increase health check timeout
OBSKIT_HEALTH_CHECK_TIMEOUT=10.0 # default is 5.0 seconds
from obskit.health.checks import RedisCheck
checker.add_check(RedisCheck(
name="redis",
url="redis://redis:6379",
timeout=3.0,
# Allow one retry before marking unhealthy
retry_count=1,
))
# Wrap custom checks defensively
from obskit.health import HealthCheck, CheckResult, CheckStatus
class MyCheck(HealthCheck):
async def check(self) -> CheckResult:
try:
await self._do_check()
return CheckResult(status=CheckStatus.HEALTHY)
except Exception as exc:
return CheckResult(
status=CheckStatus.UNHEALTHY,
message=f"Check failed: {exc}",
)
Issue: Log Correlation Not Working (No trace_id)¶
obskit injects trace_id and span_id into every log event only when an active span exists in the current context.
Diagnosis steps¶
from obskit.logging import get_logger
from obskit.tracing import setup_tracing
# Step 1: confirm tracing is initialised
setup_tracing(service_name="test", debug=True)
# Step 2: log inside a span
from opentelemetry import trace
logger = get_logger(__name__)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("my-operation"):
logger.info("inside span") # should include trace_id and span_id
logger.info("outside span") # will NOT have trace_id — this is correct
Common causes and fixes¶
# WRONG — logger captures context before tracing is active
from obskit.logging import get_logger
logger = get_logger(__name__)
from obskit.tracing import setup_tracing
setup_tracing(...)
# CORRECT — tracing first
from obskit.tracing import setup_tracing
setup_tracing(service_name="order-service")
from obskit.logging import get_logger
logger = get_logger(__name__)
Logs emitted outside a span will not have trace_id. This is the expected behaviour. Use the middleware or @with_span decorator to ensure spans wrap your request handlers.
If you configured structlog manually without obskit's processor chain, the injection does not occur. Use obskit's unified setup:
# v1.0.0+ (recommended)
from obskit import configure_observability
obs = configure_observability(service_name="my-service")
# Logging, tracing, and metrics are all configured automatically.
# Legacy (still supported)
from obskit.logging.factory import configure_logging
configure_logging() # sets up the full processor chain including OTel injection
Issue: obskit Tracing Not Installed But Getting Import Errors¶
ModuleNotFoundError: No module named 'obskit.tracing'
Fix¶
# obskit tracing requires the otlp extra
pip install "obskit[otlp]"
# Or install the meta-package which includes everything
pip install obskit[all]
# Verify installation
python -c "from obskit.tracing import setup_tracing; print('OK')"
Conditional import pattern¶
If tracing is optional in your codebase:
try:
from obskit.tracing import setup_tracing
_TRACING_AVAILABLE = True
except ImportError:
_TRACING_AVAILABLE = False
if _TRACING_AVAILABLE:
setup_tracing(service_name="order-service")
Issue: High Memory Usage from Metrics Cardinality¶
Prometheus cardinality explosions are one of the most common production problems. Each unique combination of label values creates a new time series in memory.
Diagnosis¶
# Check cardinality of all metrics
curl -s http://localhost:9090/metrics | python -c "
import sys, collections
series = collections.Counter()
for line in sys.stdin:
if line.startswith('#') or not line.strip():
continue
metric = line.split('{')[0].strip()
series[metric] += 1
for metric, count in series.most_common(10):
print(f'{count:6} {metric}')
"
Common causes and fixes¶
# WRONG — unique IDs in labels create cardinality explosion
# e.g. http_requests_total{path="/orders/uuid-1"}, {path="/orders/uuid-2"}, ...
metrics.request_count.labels(path=request.url.path)
# CORRECT — use route template, not concrete URL
metrics.request_count.labels(path="/orders/{order_id}")
# NEVER do this — one series per user = millions of series
metrics.api_calls.labels(user_id=current_user.id)
# Use aggregated dimensions instead
metrics.api_calls.labels(plan="premium")
Enable the built-in cardinality guard:
from obskit.metrics.cardinality import CardinalityGuard
guard = CardinalityGuard(
max_series=10_000, # alert if any metric exceeds this
label_bounds={
"status_code": {"200", "201", "400", "404", "429", "500", "503"},
"method": {"GET", "POST", "PUT", "PATCH", "DELETE"},
"environment": {"production", "staging", "development"},
},
)
guard.install() # patches the default Prometheus registry
Any label value not in the allowed set is replaced with "__other__".
Debug Mode: Printing Spans¶
Use debug=True in setup_tracing() to print every span to stdout. This is invaluable when you cannot access Grafana:
from obskit.tracing import setup_tracing
setup_tracing(
service_name="order-service",
debug=True, # ConsoleSpanExporter — human-readable stdout
)
Never use debug=True in production
It bypasses the OTLP exporter and writes unstructured text to stdout, destroying log parsability and flooding your log aggregator.
python -m obskit.core.diagnose¶
Run the built-in diagnostic tool to get a complete picture of the effective configuration and connectivity:
python -m obskit.core.diagnose
Interpreting the output¶
obskit v1.0.0 — Diagnostic Report (2026-02-28T10:30:00Z)
==========================================================
Service
name : order-service ✓
environment : production ✓
version : 2.1.0 ✓
Tracing
enabled : True
endpoint : http://tempo:4317
reachable : True ✓
sample_rate : 0.1
insecure : False ✓
Metrics
enabled : True
port : 9090
listening : True ✓
path : /metrics
Logging
level : INFO
format : json
backend : structlog ✓
Health
timeout : 5.0 s
Packages installed
obskit 3.0.0 ✓
prometheus ✓
otlp ✓
fastapi ✓
Validation : PASS
Warning indicators¶
| Symbol | Meaning |
|---|---|
✓ |
Check passed |
! |
Warning — non-fatal issue |
✗ |
Error — action required |
Log Output Format Debugging¶
If your log aggregator (Loki, Elasticsearch, Splunk) cannot parse obskit JSON logs:
# Test log output format
python - <<'EOF'
import os
os.environ["OBSKIT_LOG_FORMAT"] = "json"
os.environ["OBSKIT_LOG_LEVEL"] = "DEBUG"
from obskit.logging import get_logger
logger = get_logger("debug-test")
logger.info("test event", key="value", number=42)
EOF
Expected JSON output:
{
"timestamp": "2026-02-28T10:30:00.123456Z",
"level": "info",
"logger": "debug-test",
"event": "test event",
"key": "value",
"number": 42,
"service": "unknown",
"environment": "development",
"version": "0.0.0"
}
If you see console format instead, check:
OBSKIT_LOG_FORMATenvironment variable is correctly set- No other code is calling
logging.basicConfig()before obskit initialises - The
logging_backendis"structlog"(loguru uses a different renderer)
Performance Profiling with Benchmarks¶
If you suspect obskit is adding latency to your hot path, run the micro-benchmarks:
# Install benchmark dependencies
pip install pytest-benchmark memory-profiler
# Run all benchmarks
cd benchmarks/
pytest bench_metrics.py bench_context.py -v
# Profile memory allocation
python bench_memory.py
# Run the macro benchmark (full stack)
python macro_runner.py --duration 60 --concurrency 50
Expected baseline numbers on a modern laptop (M2 MacBook Pro):
| Operation | p50 | p99 |
|---|---|---|
logger.info() |
4 µs | 12 µs |
counter.inc() |
0.8 µs | 2 µs |
histogram.observe() |
1.2 µs | 4 µs |
setup_tracing() (startup, once) |
50 ms | — |
| Context propagation per span | 5 µs | 20 µs |
If your numbers are significantly higher, check:
- Log format:
"json"is slower than"console"— normal for production but avoid in tight loops. - Trace sample rate:
1.0in production with high throughput creates export back-pressure. Drop to0.1. - Async queue depth: If
async_metric_queue_sizeis too small, the queue fills and drops events; too large causes GC pressure.
# Profile a specific hot path
python -c "
import cProfile, pstats, io
from obskit.logging import get_logger
logger = get_logger('bench')
pr = cProfile.Profile()
pr.enable()
for _ in range(100_000):
logger.info('event', x=1)
pr.disable()
s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats('cumulative').print_stats(20)
print(s.getvalue())
"
Getting Help¶
If none of the above resolves your issue:
- Run
python -m obskit.core.diagnoseand include the output in your bug report. - Check the GitHub Issues for similar reports.
- Open a new issue with: obskit version, Python version, OS, minimal reproduction script, and the diagnose output.
obskit self-metrics
When OBSKIT_ENABLE_SELF_METRICS=true, obskit exposes its own internal metrics at the same /metrics endpoint:
# Async metric queue depth (alert if > 80 % full)
obskit_async_metric_queue_depth / obskit_async_metric_queue_capacity
# Spans dropped due to full export queue
rate(obskit_spans_dropped_total[1m])
# OTLP export errors
rate(obskit_otlp_export_errors_total[5m])
If obskit_spans_dropped_total is rising, increase OBSKIT_TRACE_EXPORT_QUEUE_SIZE or reduce your sample rate.
Structured log search for common errors
Search Loki or Elasticsearch for obskit-emitted error events:
# Loki query for slow requests (> 1 s)
{app="order-service"} | json | duration_ms > 1000
# Loki query for all ERROR level events
{app="order-service"} | json | level = "error"
High obskit_async_metric_queue_depth
If this metric consistently exceeds 80 % of capacity, your metric recording rate exceeds the export rate. Remedies in order of preference:
- Reduce
OBSKIT_METRICS_SAMPLE_RATEfor high-frequency paths - Increase
OBSKIT_ASYNC_METRIC_QUEUE_SIZE(uses more memory) - Reduce histogram bucket count (fewer buckets = faster processing)