Skip to content

SLO Tracking

A Service Level Objective (SLO) is a commitment to your users: "X% of requests will meet quality criteria Y over time window Z." obskit's SLOTracker records events, computes SLIs, tracks error budgets across multiple time windows, and generates Prometheus alert rules.


Quick Start

Python
from obskit.slo import SLOTracker, SLOWindow

tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,             # 99.9% availability
    windows=[SLOWindow.DAY, SLOWindow.MONTH],
)

# Record outcomes as requests complete
tracker.record_success()
tracker.record_failure(error_type="timeout")

# Inspect current status
report = tracker.get_report()
print(f"SLI: {report['sli']:.4%}")
print(f"Error budget remaining: {report['budget_remaining']:.1%}")

What Are SLOs and Error Budgets?

Service Level Indicator (SLI)

An SLI is the measured value of the quality criterion. Common SLIs:

SLI type Example definition
Availability Fraction of requests that returned a non-5xx response
Latency Fraction of requests that completed in < 300 ms
Error rate Fraction of requests that did not return an error
Data freshness Fraction of reads that returned data updated in < 1 hour

Service Level Objective (SLO)

The SLO is the target for the SLI. If the SLI is "fraction of successful requests", an SLO of 99.9% means you target at most 1 failure per 1000 requests.

Error Budget

The error budget is 1 - SLO. At 99.9% availability over 30 days: - Total events budget: 0.1% can fail - In time terms: 43.8 minutes of downtime allowed per month

Error budgets create a shared language between reliability and feature work: - Budget is healthy → ship fast, take risks - Budget is at 50% → review upcoming risky deployments - Budget is exhausted → reliability sprint, freeze risky changes


SLOTracker API

Python
from obskit.slo import SLOTracker, SLOWindow

tracker = SLOTracker(
    name="checkout_availability",   # Unique name — used in metric labels
    objective=0.999,                # SLO target (0.0–1.0)
    windows=[                       # Track error budget in these windows
        SLOWindow.HOUR,             # 1 hour rolling window
        SLOWindow.DAY,              # 24 hour rolling window
        SLOWindow.WEEK,             # 7 day rolling window
        SLOWindow.MONTH,            # 30 day rolling window
    ],
    labels={"service": "checkout", "tier": "critical"},  # Extra Prometheus labels
)

SLOWindow

Value Duration Use case
SLOWindow.HOUR 1 hour Short-term burn rate alerts
SLOWindow.DAY 24 hours Daily health reviews
SLOWindow.WEEK 7 days Weekly reliability reviews
SLOWindow.MONTH 30 days Monthly SLA reporting

Recording Events

Availability SLO

Python
# In your request handler:
try:
    result = await checkout(cart_id)
    tracker.record_success()
    return result
except Exception as exc:
    tracker.record_failure(
        error_type=type(exc).__name__,   # Recorded in metrics for root cause analysis
    )
    raise

Latency SLO

For latency SLOs, you define "success" as completing within a threshold:

Python
import time

tracker = SLOTracker(
    name="checkout_latency_p99",
    objective=0.99,   # 99% of requests must complete within the threshold
)

start = time.perf_counter()
result = await checkout(cart_id)
duration = time.perf_counter() - start

LATENCY_THRESHOLD = 0.5   # 500ms

if duration < LATENCY_THRESHOLD:
    tracker.record_success()
else:
    tracker.record_failure(error_type="latency_exceeded")

Using the with_slo_tracking decorator

Python
from obskit.slo import with_slo_tracking

tracker = SLOTracker(name="checkout_availability", objective=0.999)

@with_slo_tracking(tracker)
async def checkout(cart_id: str) -> dict:
    return await _do_checkout(cart_id)

The decorator records success/failure based on whether an exception is raised.


Inspecting SLO Status

get_report()

Python
report = tracker.get_report()

Returns a dict with current SLI and budget status for all configured windows:

Python
{
    "name": "checkout_availability",
    "objective": 0.999,
    "sli": 0.9993,                    # Current SLI (trailing 30-day window)
    "budget_remaining": 0.7,          # 70% of error budget remaining
    "is_within_slo": True,
    "windows": {
        "1h": {
            "sli": 0.9988,
            "budget_remaining": 0.12,
            "requests_total": 48201,
            "failures_total": 57,
        },
        "24h": {
            "sli": 0.9993,
            "budget_remaining": 0.7,
            "requests_total": 1152420,
            "failures_total": 807,
        },
        "7d": {
            "sli": 0.9994,
            "budget_remaining": 0.6,
            "requests_total": 8067140,
            "failures_total": 4840,
        },
        "30d": {
            "sli": 0.9993,
            "budget_remaining": 0.7,
            "requests_total": 34602600,
            "failures_total": 24222,
        },
    },
}

Prometheus Metrics

obskit automatically exposes SLO data as Prometheus metrics:

Text Only
# Current SLI per window
obskit_slo_sli{name="checkout_availability", window="1h"}   0.9988
obskit_slo_sli{name="checkout_availability", window="24h"}  0.9993

# Error budget remaining (ratio)
obskit_slo_budget_remaining{name="checkout_availability", window="1h"}  0.12
obskit_slo_budget_remaining{name="checkout_availability", window="30d"} 0.70

# Event counters (for computing SLI externally with PromQL)
obskit_slo_requests_total{name="checkout_availability"}  34602600
obskit_slo_failures_total{name="checkout_availability", error_type="timeout"} 18442
obskit_slo_failures_total{name="checkout_availability", error_type="PaymentError"} 5780

Multi-Window Alerting (Burn Rate)

The most effective SLO alerting strategy uses burn rate: how fast is the error budget being consumed relative to the budget replenishment rate?

A burn rate of 1.0 means budget is being consumed at exactly the rate it replenishes (you will exactly hit the SLO at end of month). A burn rate of 14.4 means budget will be exhausted in 2 hours (for a 30-day SLO).

Alert rules generated by obskit

Python
from obskit.slo import SLOTracker, SLOWindow
from obskit.alerts.rules_generator import generate_alert_rules

tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,
    windows=[SLOWindow.HOUR, SLOWindow.DAY, SLOWindow.WEEK, SLOWindow.MONTH],
)

rules = generate_alert_rules(tracker)
print(rules.to_yaml())

Generated Prometheus alert rules:

YAML
groups:
  - name: slo.checkout_availability
    rules:
      # Page immediately — burning budget very fast
      - alert: CheckoutAvailabilityBurnRateCritical
        expr: >
          (
            obskit_slo_failures_total{name="checkout_availability"}[1h]
            / obskit_slo_requests_total{name="checkout_availability"}[1h]
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
          slo: checkout_availability
        annotations:
          summary: "Checkout SLO burning at 14.4x  budget exhausted in 2h"
          runbook_url: "https://runbooks.example.com/slo/checkout"

      # Wake up but do not page — burning fast but not critically
      - alert: CheckoutAvailabilityBurnRateHigh
        expr: >
          (
            obskit_slo_failures_total{name="checkout_availability"}[6h]
            / obskit_slo_requests_total{name="checkout_availability"}[6h]
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: checkout_availability
        annotations:
          summary: "Checkout SLO burning at 6x  investigate within 1 hour"

      # Budget nearly exhausted
      - alert: CheckoutAvailabilityBudgetLow
        expr: obskit_slo_budget_remaining{name="checkout_availability", window="30d"} < 0.1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Checkout error budget < 10% remaining for the month"

Integration with Health Checks

Surface SLO status in your /health/ready endpoint:

Python
from obskit.health import HealthChecker, HealthStatus
from obskit.health.checker import HealthResult
from obskit.slo import SLOTracker

tracker = SLOTracker(name="checkout_availability", objective=0.999)

async def slo_health_check() -> HealthResult:
    report = tracker.get_report()
    budget = report["budget_remaining"]

    if not report["is_within_slo"]:
        return HealthResult(
            status=HealthStatus.unhealthy,
            message=f"SLO violated: SLI={report['sli']:.4%} < objective={report['objective']:.4%}",
        )
    if budget < 0.10:
        return HealthResult(
            status=HealthStatus.degraded,
            message=f"Error budget critical: {budget:.0%} remaining",
        )
    return HealthResult(
        status=HealthStatus.healthy,
        message=f"SLI {report['sli']:.4%} ({budget:.0%} budget remaining)",
    )

checker = HealthChecker(service="checkout")
checker.add_check("slo_checkout", slo_health_check, critical=False)

Grafana Dashboard

Key panels for SLO dashboards

PromQL
1 - (
  increase(obskit_slo_failures_total{name="checkout_availability"}[$__range])
  /
  increase(obskit_slo_requests_total{name="checkout_availability"}[$__range])
)
PromQL
(
  rate(obskit_slo_failures_total{name="checkout_availability"}[1h])
  /
  rate(obskit_slo_requests_total{name="checkout_availability"}[1h])
) / 0.001  # Divide by error rate = 1 - objective
PromQL
obskit_slo_budget_remaining{name="checkout_availability", window="30d"}
PromQL
sum by (error_type) (
  rate(obskit_slo_failures_total{name="checkout_availability"}[1h])
)

Dashboard annotations

Mark deployments and incidents on your SLO dashboards to correlate changes with budget burn:

Python
# In your deployment pipeline:
import requests

requests.post("http://grafana:3000/api/annotations", json={
    "tags": ["deployment", "checkout-service"],
    "text": f"Deployed checkout-service v2.1.0",
    "time": int(time.time() * 1000),
})

Defining Good SLOs

Start conservative

It is easier to tighten an SLO than to loosen it. Start at 99% and tighten based on observed SLI data over 3–6 months.

Match user expectations

Ask: "What level of reliability do users notice?" Users rarely notice 99.9% → 99% degradation. They do notice 99% → 95%.

Cover latency, not just availability

A service that always responds with 500ms p99 is technically "available" but provides poor user experience. Include latency SLOs:

Python
availability_tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,
)
latency_tracker = SLOTracker(
    name="checkout_latency_p95",
    objective=0.95,   # 95% of requests under 300ms
)

SLO Naming Conventions

Name Objective Window SLI definition
checkout_availability 99.9% 30d Non-5xx responses / total responses
checkout_latency_p95 95% 30d Responses < 300ms / total responses
search_latency_p99 99% 30d Responses < 1000ms / total responses
payment_success_rate 99.5% 30d Successful payment transactions / total attempts