SLO Tracking¶

A Service Level Objective (SLO) is a commitment to your users: "X% of requests will meet quality criteria Y over time window Z." obskit's SLOTracker records events, computes SLIs, tracks error budgets across multiple time windows, and generates Prometheus alert rules.

Quick Start¶

Python

from obskit.slo import SLOTracker, SLOWindow

tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,             # 99.9% availability
    windows=[SLOWindow.DAY, SLOWindow.MONTH],
)

# Record outcomes as requests complete
tracker.record_success()
tracker.record_failure(error_type="timeout")

# Inspect current status
report = tracker.get_report()
print(f"SLI: {report['sli']:.4%}")
print(f"Error budget remaining: {report['budget_remaining']:.1%}")

What Are SLOs and Error Budgets?¶

Service Level Indicator (SLI)¶

An SLI is the measured value of the quality criterion. Common SLIs:

SLI type	Example definition
Availability	Fraction of requests that returned a non-5xx response
Latency	Fraction of requests that completed in < 300 ms
Error rate	Fraction of requests that did not return an error
Data freshness	Fraction of reads that returned data updated in < 1 hour

Service Level Objective (SLO)¶

The SLO is the target for the SLI. If the SLI is "fraction of successful requests", an SLO of 99.9% means you target at most 1 failure per 1000 requests.

Error Budget¶

The error budget is 1 - SLO. At 99.9% availability over 30 days: - Total events budget: 0.1% can fail - In time terms: 43.8 minutes of downtime allowed per month

Error budgets create a shared language between reliability and feature work: - Budget is healthy → ship fast, take risks - Budget is at 50% → review upcoming risky deployments - Budget is exhausted → reliability sprint, freeze risky changes

SLOTracker API¶

Python

from obskit.slo import SLOTracker, SLOWindow

tracker = SLOTracker(
    name="checkout_availability",   # Unique name — used in metric labels
    objective=0.999,                # SLO target (0.0–1.0)
    windows=[                       # Track error budget in these windows
        SLOWindow.HOUR,             # 1 hour rolling window
        SLOWindow.DAY,              # 24 hour rolling window
        SLOWindow.WEEK,             # 7 day rolling window
        SLOWindow.MONTH,            # 30 day rolling window
    ],
    labels={"service": "checkout", "tier": "critical"},  # Extra Prometheus labels
)

SLOWindow¶

Value	Duration	Use case
`SLOWindow.HOUR`	1 hour	Short-term burn rate alerts
`SLOWindow.DAY`	24 hours	Daily health reviews
`SLOWindow.WEEK`	7 days	Weekly reliability reviews
`SLOWindow.MONTH`	30 days	Monthly SLA reporting

Recording Events¶

Availability SLO¶

Python

# In your request handler:
try:
    result = await checkout(cart_id)
    tracker.record_success()
    return result
except Exception as exc:
    tracker.record_failure(
        error_type=type(exc).__name__,   # Recorded in metrics for root cause analysis
    )
    raise

Latency SLO¶

For latency SLOs, you define "success" as completing within a threshold:

Python

import time

tracker = SLOTracker(
    name="checkout_latency_p99",
    objective=0.99,   # 99% of requests must complete within the threshold
)

start = time.perf_counter()
result = await checkout(cart_id)
duration = time.perf_counter() - start

LATENCY_THRESHOLD = 0.5   # 500ms

if duration < LATENCY_THRESHOLD:
    tracker.record_success()
else:
    tracker.record_failure(error_type="latency_exceeded")

Using the `with_slo_tracking` decorator¶

Python

from obskit.slo import with_slo_tracking

tracker = SLOTracker(name="checkout_availability", objective=0.999)

@with_slo_tracking(tracker)
async def checkout(cart_id: str) -> dict:
    return await _do_checkout(cart_id)

The decorator records success/failure based on whether an exception is raised.

Inspecting SLO Status¶

`get_report()`¶

Python

report = tracker.get_report()

Returns a dict with current SLI and budget status for all configured windows:

Python

{
    "name": "checkout_availability",
    "objective": 0.999,
    "sli": 0.9993,                    # Current SLI (trailing 30-day window)
    "budget_remaining": 0.7,          # 70% of error budget remaining
    "is_within_slo": True,
    "windows": {
        "1h": {
            "sli": 0.9988,
            "budget_remaining": 0.12,
            "requests_total": 48201,
            "failures_total": 57,
        },
        "24h": {
            "sli": 0.9993,
            "budget_remaining": 0.7,
            "requests_total": 1152420,
            "failures_total": 807,
        },
        "7d": {
            "sli": 0.9994,
            "budget_remaining": 0.6,
            "requests_total": 8067140,
            "failures_total": 4840,
        },
        "30d": {
            "sli": 0.9993,
            "budget_remaining": 0.7,
            "requests_total": 34602600,
            "failures_total": 24222,
        },
    },
}

Prometheus Metrics¶

obskit automatically exposes SLO data as Prometheus metrics:

Text Only

# Current SLI per window
obskit_slo_sli{name="checkout_availability", window="1h"}   0.9988
obskit_slo_sli{name="checkout_availability", window="24h"}  0.9993

# Error budget remaining (ratio)
obskit_slo_budget_remaining{name="checkout_availability", window="1h"}  0.12
obskit_slo_budget_remaining{name="checkout_availability", window="30d"} 0.70

# Event counters (for computing SLI externally with PromQL)
obskit_slo_requests_total{name="checkout_availability"}  34602600
obskit_slo_failures_total{name="checkout_availability", error_type="timeout"} 18442
obskit_slo_failures_total{name="checkout_availability", error_type="PaymentError"} 5780

Multi-Window Alerting (Burn Rate)¶

The most effective SLO alerting strategy uses burn rate: how fast is the error budget being consumed relative to the budget replenishment rate?

A burn rate of 1.0 means budget is being consumed at exactly the rate it replenishes (you will exactly hit the SLO at end of month). A burn rate of 14.4 means budget will be exhausted in 2 hours (for a 30-day SLO).

Alert rules generated by obskit¶

Python

from obskit.slo import SLOTracker, SLOWindow
from obskit.alerts.rules_generator import generate_alert_rules

tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,
    windows=[SLOWindow.HOUR, SLOWindow.DAY, SLOWindow.WEEK, SLOWindow.MONTH],
)

rules = generate_alert_rules(tracker)
print(rules.to_yaml())

Generated Prometheus alert rules:

YAML

groups:
  - name: slo.checkout_availability
    rules:
      # Page immediately — burning budget very fast
      - alert: CheckoutAvailabilityBurnRateCritical
        expr: >
          (
            obskit_slo_failures_total{name="checkout_availability"}[1h]
            / obskit_slo_requests_total{name="checkout_availability"}[1h]
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
          slo: checkout_availability
        annotations:
          summary: "Checkout SLO burning at 14.4x — budget exhausted in 2h"
          runbook_url: "https://runbooks.example.com/slo/checkout"

      # Wake up but do not page — burning fast but not critically
      - alert: CheckoutAvailabilityBurnRateHigh
        expr: >
          (
            obskit_slo_failures_total{name="checkout_availability"}[6h]
            / obskit_slo_requests_total{name="checkout_availability"}[6h]
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: checkout_availability
        annotations:
          summary: "Checkout SLO burning at 6x — investigate within 1 hour"

      # Budget nearly exhausted
      - alert: CheckoutAvailabilityBudgetLow
        expr: obskit_slo_budget_remaining{name="checkout_availability", window="30d"} < 0.1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Checkout error budget < 10% remaining for the month"

Integration with Health Checks¶

Surface SLO status in your /health/ready endpoint:

Python

from obskit.health import HealthChecker, HealthStatus
from obskit.health.checker import HealthResult
from obskit.slo import SLOTracker

tracker = SLOTracker(name="checkout_availability", objective=0.999)

async def slo_health_check() -> HealthResult:
    report = tracker.get_report()
    budget = report["budget_remaining"]

    if not report["is_within_slo"]:
        return HealthResult(
            status=HealthStatus.unhealthy,
            message=f"SLO violated: SLI={report['sli']:.4%} < objective={report['objective']:.4%}",
        )
    if budget < 0.10:
        return HealthResult(
            status=HealthStatus.degraded,
            message=f"Error budget critical: {budget:.0%} remaining",
        )
    return HealthResult(
        status=HealthStatus.healthy,
        message=f"SLI {report['sli']:.4%} ({budget:.0%} budget remaining)",
    )

checker = HealthChecker(service="checkout")
checker.add_check("slo_checkout", slo_health_check, critical=False)

Grafana Dashboard¶

Key panels for SLO dashboards¶

SLI over timeError budget burn rateBudget remainingFailure breakdown by type

PromQL

1 - (
  increase(obskit_slo_failures_total{name="checkout_availability"}[$__range])
  /
  increase(obskit_slo_requests_total{name="checkout_availability"}[$__range])
)

PromQL

(
  rate(obskit_slo_failures_total{name="checkout_availability"}[1h])
  /
  rate(obskit_slo_requests_total{name="checkout_availability"}[1h])
) / 0.001  # Divide by error rate = 1 - objective

PromQL

obskit_slo_budget_remaining{name="checkout_availability", window="30d"}

PromQL

sum by (error_type) (
  rate(obskit_slo_failures_total{name="checkout_availability"}[1h])
)

Dashboard annotations¶

Mark deployments and incidents on your SLO dashboards to correlate changes with budget burn:

Python

# In your deployment pipeline:
import requests

requests.post("http://grafana:3000/api/annotations", json={
    "tags": ["deployment", "checkout-service"],
    "text": f"Deployed checkout-service v2.1.0",
    "time": int(time.time() * 1000),
})

Defining Good SLOs¶

Start conservative¶

It is easier to tighten an SLO than to loosen it. Start at 99% and tighten based on observed SLI data over 3–6 months.

Match user expectations¶

Ask: "What level of reliability do users notice?" Users rarely notice 99.9% → 99% degradation. They do notice 99% → 95%.

Cover latency, not just availability¶

A service that always responds with 500ms p99 is technically "available" but provides poor user experience. Include latency SLOs:

Python

availability_tracker = SLOTracker(
    name="checkout_availability",
    objective=0.999,
)
latency_tracker = SLOTracker(
    name="checkout_latency_p95",
    objective=0.95,   # 95% of requests under 300ms
)

SLO Naming Conventions¶

Name	Objective	Window	SLI definition
`checkout_availability`	99.9%	30d	Non-5xx responses / total responses
`checkout_latency_p95`	95%	30d	Responses < 300ms / total responses
`search_latency_p99`	99%	30d	Responses < 1000ms / total responses
`payment_success_rate`	99.5%	30d	Successful payment transactions / total attempts

SLO Tracking¶

Quick Start¶

What Are SLOs and Error Budgets?¶

Service Level Indicator (SLI)¶

Service Level Objective (SLO)¶

Error Budget¶

SLOTracker API¶

SLOWindow¶

Recording Events¶

Availability SLO¶

Latency SLO¶

Using the with_slo_tracking decorator¶

Inspecting SLO Status¶

get_report()¶

Prometheus Metrics¶

Multi-Window Alerting (Burn Rate)¶

Alert rules generated by obskit¶

Integration with Health Checks¶

Grafana Dashboard¶

Key panels for SLO dashboards¶

Dashboard annotations¶

Defining Good SLOs¶

Start conservative¶

Match user expectations¶

Cover latency, not just availability¶

SLO Naming Conventions¶

Using the `with_slo_tracking` decorator¶

`get_report()`¶