Performance Guide¶

obskit is designed for production use in high-throughput services. This page documents benchmark results, overhead budgets, and tuning recommendations.

Benchmark Methodology¶

All benchmarks are run with pytest-benchmark in single-threaded mode on a fixed-frequency CPU (no turbo boost). Results are reported as minimum latency (best case) and operations per second.

Bash

# Run micro-benchmarks
pytest benchmarks/ --benchmark-only -p no:xdist -o addopts="" \
  --benchmark-columns=min,mean,median,stddev,ops \
  --benchmark-warmup=on --benchmark-min-rounds=50

# Run macro-benchmarks
python benchmarks/macro_runner.py --requests 10000 --workers 16

# Run memory benchmarks
python benchmarks/bench_memory.py

Go / No-Go Thresholds¶

These are the release gates — a PR that causes any metric to exceed its threshold is blocked until the regression is resolved.

Micro-benchmark thresholds¶

Benchmark	Min latency threshold	Ops/s threshold
`with_observability` sync (no-op)	≤ 50 µs	≥ 20,000
`with_observability` async (no-op)	≤ 50 µs	≥ 20,000
`with_observability` with exception	≤ 100 µs	≥ 10,000
Decorator stack depth 3	≤ 150 µs	≥ 6,000
`SLOTracker.record_measurement()`	≤ 5 µs	≥ 200,000
`SLOTracker.get_status()` (full window)	≤ 20 µs	≥ 50,000
`logger.info()`	≤ 20 µs	≥ 50,000
Correlation ID set + get	≤ 1 µs	≥ 1,000,000
`REDMetrics.record_request()`	≤ 5 µs	≥ 200,000

Macro-benchmark thresholds (p99)¶

10,000 requests, 16 workers, Zipf tenant distribution, lognormal latency.

Scenario	p99 budget	Min req/s	Error rate
`metrics_only`	≤ 50 µs	≥ 50,000	0%
`logging_only`	≤ 100 µs	≥ 20,000	0%
`slo_only`	≤ 20 µs	≥ 100,000	0%
`full_stack`	≤ 200 µs	≥ 10,000	0%
`high_cardinality` (500 unique labels)	≤ 500 µs	≥ 1,000	0%

Memory thresholds¶

Metric	Threshold
Per-call net allocation — `with_observability`	≤ 2 KiB per 1,000 calls
SLO 1-hour rolling window (10,000 records)	≤ 5 MB
SLO 0-second window with full eviction	≤ 500 KB
Logger `bind()` + `info()` (1,000 calls)	≤ 500 KB
Prometheus cardinality (500 unique label sets)	≤ 10 MB
Leak detector delta (5,000 requests after warmup)	< 250 objects

Per-Operation Overhead Reference¶

Use these numbers to reason about overhead at your traffic level.

Operation	Overhead	Notes
`get_logger(__name__)`	~2 µs (once)	Cached after first call
`logger.info("event", **kwargs)`	~15–20 µs	Structlog pipeline: format, contextvars, JSON render
`REDMetrics.record_request()`	~3–5 µs	Counter.inc() + Histogram.observe() + label lookup
`observe_with_exemplar()`	~5–8 µs	Same as above + OTel span context read
`trace_span()` enter	~2–5 µs	OTel span creation + context push
`trace_span()` exit (no error)	~2–4 µs	Span end + attribute flush
`async_trace_span()` enter	~3–6 µs	Async overhead + OTel span creation
`SLOTracker.record_measurement()`	~3–5 µs	Lock + list append + eviction check
`CardinalityGuard.safe_label()`	~1–2 µs (cached)	Dict lookup + counter check
`CardinalityGuard.safe_label()` (new label)	~5–10 µs	Lock + dict insert
`set_baggage()`	< 1 µs	ContextVar set
`get_baggage()`	< 1 µs	ContextVar get

Sample Rate Recommendations¶

Choose your sampling rates based on traffic volume. The goal is to balance observability completeness with overhead.

Trace sample rate (`OBSKIT_TRACE_SAMPLE_RATE`)¶

Traffic level	Recommended rate	Rationale
< 10 req/s	`1.0` (100%)	Low volume; full trace coverage is cheap
10–100 req/s	`1.0` (100%)	Still manageable; full coverage recommended
100–1,000 req/s	`0.1` (10%)	Reduces Tempo storage by 10×
1,000–10,000 req/s	`0.01` (1%)	Standard for high-throughput services
> 10,000 req/s	`0.001` (0.1%)	Use with head-based + tail-based sampling

Always sample errors

obskit will sample 100% of error spans regardless of trace_sample_rate. Set OBSKIT_TRACE_SAMPLE_RATE=0.01 and you will still see all errors.

Log sample rate (`OBSKIT_LOG_SAMPLE_RATE`)¶

Traffic level	Recommended rate	Notes
< 100 req/s	`1.0`	Log everything
100–1,000 req/s	`0.1`	Sample INFO and below; always emit WARNING+
> 1,000 req/s	`0.01`	Use `AdaptiveSampler`; auto-increases on errors

Metrics sample rate (`OBSKIT_METRICS_SAMPLE_RATE`)¶

In most cases, keep metrics at 1.0. Prometheus counters are inaccurate when sampled. Only reduce if metrics collection itself is a bottleneck (unusual).

Cardinality Budget Guidelines¶

Prometheus's memory usage scales linearly with the number of unique time series. A single Histogram with 4 label combinations × 10 buckets = 40 time series.

Budget formula:

Text Only

Total time series = Σ (unique_label_combinations × histogram_buckets)

Recommended limits:

Resource	Time series budget	Notes
Small service (2 GB Prometheus)	≤ 100,000	~50 metrics × 2,000 label combos
Medium service (8 GB Prometheus)	≤ 500,000	Standard for mid-size deployments
Large service (32 GB Prometheus)	≤ 2,000,000	Requires tuned Prometheus config

Practical rules:

Never use user_id, request_id, or any unbounded value as a label.
Use CardinalityGuard(max_cardinality=500) for any label derived from user input.
Use at most 4–5 label dimensions per metric.
Prefer low-cardinality enumerations: status={"success","error"}, method={"GET","POST",…}.

Async vs Sync Performance¶

obskit supports both async and sync code paths. Async is slightly higher-overhead due to event loop scheduling, but enables higher concurrency without blocking.

Pattern	Latency	Concurrency	Use when
`trace_span()` (sync)	~4 µs	Limited by threads	Django, Flask, Celery
`async_trace_span()` (async)	~6 µs	Unlimited (cooperative)	FastAPI, aiohttp, async workers

Recommendation: Use async APIs in async code and sync APIs in sync code. Mixing (e.g., calling async from sync) requires asyncio.run() and has ~50 µs overhead for event loop creation.

Memory Footprint Per Component¶

Approximate RSS increase per component at steady state (after 10,000 requests):

Component	RSS delta	Dominant contributor
`obskit` (core + logging)	~5 MB	pydantic-settings model + structlog processor chain
`obskit[prometheus]`	~5–50 MB	Prometheus registry (scales with cardinality)
`obskit[otlp]`	~10 MB	OTel SDK + BatchSpanProcessor queue (2,048 spans)
`obskit` health module	~1 MB	HealthChecker state + check registry
`obskit` slo module	~2–20 MB	SLO measurement windows (scales with window size)
`obskit[kafka]` / `obskit[rabbitmq]`	~2 MB	Kafka/RabbitMQ consumer metrics

Prometheus cardinality dominates memory

The obskit[prometheus] footprint depends almost entirely on how many unique label combinations exist. 1,000 unique time series ≈ ~1 MB. 10,000 ≈ ~10 MB.

Production Tuning Tips¶

Use `async_trace_span` in async code¶

Python

# Avoid: sync span in async context forces thread-local context
with trace_span("my_op"):           # OK but not ideal in async
    await do_work()

# Prefer: async context manager
async with async_trace_span("my_op"):
    await do_work()

Disable debug mode in production¶

Bash

# Avoid in production — ConsoleSpanExporter is synchronous and slow
OBSKIT_LOG_FORMAT=console  # only for local development

# Use in production
OBSKIT_LOG_FORMAT=json

Set sample_rate for high-throughput services¶

Bash

# > 1,000 req/s
OBSKIT_TRACE_SAMPLE_RATE=0.01
OBSKIT_LOG_SAMPLE_RATE=0.01

Reduce trace export batch timeout for lower-latency shutdown¶

Bash

OBSKIT_TRACE_EXPORT_TIMEOUT=5.0   # default 30s; reduce for faster pod shutdown

Pin CPU frequency before benchmarking¶

Bash

# Linux
sudo cpupower frequency-set -g performance

# Pin to a single core to reduce jitter
taskset -c 2 pytest benchmarks/ --benchmark-only

Profile hot paths with py-spy¶

Bash

pip install py-spy
py-spy record -o /tmp/obskit.svg --pid $(pgrep -f "uvicorn main:app") --duration 30
# Open /tmp/obskit.svg in a browser — identifies the actual bottleneck

CI Performance Regression Check¶

Add this to your CI pipeline to catch regressions automatically:

YAML

- name: Benchmark regression check
  run: |
    pytest benchmarks/ --benchmark-only -p no:xdist -o addopts="" \
      --benchmark-json=results/bench_pr.json \
      --benchmark-compare=results/bench_main.json \
      --benchmark-compare-fail=mean:10%

Regression policy:

Delta vs baseline	Action
< 5% slower	Acceptable noise — pass
5–10% slower	Review required — explain in PR
> 10% slower	Block merge — mandatory investigation
Memory leak detected	Block merge — must be fixed

Performance Guide¶

Benchmark Methodology¶

Go / No-Go Thresholds¶

Micro-benchmark thresholds¶

Macro-benchmark thresholds (p99)¶

Memory thresholds¶

Per-Operation Overhead Reference¶

Sample Rate Recommendations¶

Trace sample rate (OBSKIT_TRACE_SAMPLE_RATE)¶

Log sample rate (OBSKIT_LOG_SAMPLE_RATE)¶

Metrics sample rate (OBSKIT_METRICS_SAMPLE_RATE)¶

Cardinality Budget Guidelines¶

Async vs Sync Performance¶

Memory Footprint Per Component¶

Production Tuning Tips¶

Use async_trace_span in async code¶

Disable debug mode in production¶

Set sample_rate for high-throughput services¶

Reduce trace export batch timeout for lower-latency shutdown¶

Pin CPU frequency before benchmarking¶

Profile hot paths with py-spy¶

CI Performance Regression Check¶

Trace sample rate (`OBSKIT_TRACE_SAMPLE_RATE`)¶

Log sample rate (`OBSKIT_LOG_SAMPLE_RATE`)¶

Metrics sample rate (`OBSKIT_METRICS_SAMPLE_RATE`)¶

Use `async_trace_span` in async code¶