LLMScope is a reference implementation of an LLM control plane that answers three questions most production teams cannot: what did that request cost, which model served it, and why. It demonstrates gateway-based cost attribution, OpenTelemetry instrumentation aligned with GenAI semantic conventions, a provider-agnostic abstraction, a versioned envelope contract for request lifecycle, deterministic evaluation, and offline reporting — built as an inspectable engineering artifact, not a demo.
Production LLM systems require cost visibility, model routing transparency, and request-level attribution. Most implementations scatter provider calls across route handlers, making complete telemetry impossible. LLMScope enforces a single gateway choke point (call_llm()) through which all provider traffic flows, enabling accurate cost tracking, structured context propagation, and deterministic offline analysis without runtime dependencies on external observability vendors.
This repository demonstrates the control-plane layer for accountable AI systems: the instrumentation, attribution, and evidence infrastructure that makes LLM operations measurable and auditable.
Implemented and tested:
-
Gateway choke point —
src/llmscope/gateway/client.pyis the sole path for all LLM provider calls. No route handler calls providers directly. This architectural invariant makes cost attribution complete. -
Provider abstraction —
ProviderBasedefines a minimal contract (2 abstract methods, 2 optional).OpenAIProviderandAnthropicProviderare built-in. Adding Google, Bedrock, or custom providers requires subclassingProviderBaseand registering — no changes to gateway or route logic. -
Versioned envelope contract —
LLMRequestEnvelope(v0.1.0) defines 6 semantic blocks: identity/context, model selection, economics, reliability, governance, cache/eval. Every gateway call constructs an envelope and serializes it to JSONL, making local artifacts a direct representation of the typed contract. -
Structured attribution context —
LLMRequestContextpropagates tenant_id, caller_id, use_case, feature_id, experiment_id, and budget_namespace through the call stack and into telemetry without framework coupling. -
OpenTelemetry instrumentation — each gateway call emits an OTel CLIENT span (nested under FastAPI SERVER span) and records 4 metric instruments:
gen_ai.client.token.usage,gen_ai.client.operation.duration,llm_gateway.estimated_cost_usd,llm_gateway.requests. Backend-agnostic via OTLP. -
Semantic convention isolation —
src/llmscope/gateway/semconv.pycentralizes GenAI attribute names and implementsresolve_attrs()for dual-emission migration viaOTEL_SEMCONV_STABILITY_OPT_IN. No other module imports fromopentelemetry.semconv._incubating. -
Dual telemetry emission — parallel to OTel, each call serializes the envelope as JSON to
artifacts/logs/telemetry.jsonl. Supports offline analysis without a running collector. -
Deterministic cost model — local pricing snapshot for OpenAI and Anthropic. Per-request USD estimation without external billing lookup. Reproducible and auditable.
-
Keyword-based routing — deterministic complexity classifier assigns model tiers (cheap/expensive) based on keywords and message length. Intentionally simple and inspectable.
-
Context budget management — token counting via tiktoken, configurable strategies (full, sliding_window, summarized), raises
ContextTooLargeErrorwhen budget exceeded. -
Dataset-driven evaluation — 3 eval runners (classify, answer_routed, conversation_turn) using FastAPI TestClient with deterministic mocks. Checks schema compliance, required fields, routing metadata. Not semantic evaluation.
-
Deterministic reporting — markdown report generator from JSONL telemetry. Per-route aggregates, Pareto analysis, before/after comparison mode. Example artifacts included (10 synthetic events + generated report).
-
Auth and rate limiting — API key validation via X-API-Key header (constant-time comparison), per-caller sliding window rate limiting (in-memory deque).
257 tests, GitHub Actions CI (ruff, mypy, pytest).
Current implementation (v0.1.0):
HTTP Request
└── APIKeyMiddleware X-API-Key validation, 401 on failure
└── RateLimitMiddleware per-key sliding window, 429 on overflow
└── FastAPIInstrumentor OTel SERVER span
└── Route Handler
├── /classify-complexity → determine_complexity() [no LLM call]
├── /answer-routed → determine_complexity() → call_llm()
└── /conversation-turn → prepare_context() → call_llm()
call_llm() [src/llmscope/gateway/client.py]
├── RoutePolicy lookup (tier → model mapping)
├── LLMRequestContext resolution
├── OTel CLIENT span start
├── Retry loop with exponential backoff
├── ProviderBase.complete() → OpenAI/Anthropic
├── estimate_cost()
├── LLMRequestEnvelope construction
└── emit() → OTel metrics + JSONL
The non-negotiable invariant: every LLM provider call passes through call_llm(). No route, service, or middleware calls provider APIs directly.
src/llmscope/ — pip-installable library, runtime-agnostic
envelope.py—LLMRequestEnvelopedataclass,EnvelopeStatus,CostSource,CircuitStateenumscontext.py—LLMRequestContextfor structured attributionsemconv.py—llmscope.*OTel attribute constantsgateway/client.py—call_llm(),GatewayResultgateway/provider.py—ProviderBase,OpenAIProvider,AnthropicProvidergateway/telemetry.py—emit()dual-write (OTel + JSONL)gateway/cost_model.py—estimate_cost(), hardcoded pricinggateway/policies.py—RoutePolicy, tier-to-model mappinggateway/otel_setup.py—setup_otel(),shutdown_otel()gateway/semconv.py—gen_ai.*attribute constants,resolve_attrs()
app/ — reference FastAPI application (not pip-installed)
main.py— FastAPI app with lifespan OTel setupmiddleware/auth.py— API key validationmiddleware/rate_limit.py— per-key sliding windowroutes/— 3 routes + health endpointsservices/routing.py—determine_complexity()services/context_manager.py—prepare_context()services/token_counter.py—count_tokens()schemas/— Pydantic request/response contracts
evals/ — evaluation harness (not pip-installed)
runners/run_classify_eval.pyrunners/run_answer_routed_eval.pyrunners/run_conversation_turn_eval.pydatasets/*.jsonl— test casesassertions/— validation logic
reporting/ — report generator (not pip-installed)
make_report.py— CLI tool, markdown output
examples/ — pre-generated artifacts
sample_telemetry.jsonl— 10 synthetic eventssample_report.md— generated report
llmscope/
├── src/llmscope/ ← pip-installable library
│ ├── __init__.py ← public API surface
│ ├── py.typed ← PEP 561 marker
│ ├── envelope.py
│ ├── context.py
│ ├── semconv.py
│ └── gateway/
│ ├── client.py ← call_llm() — the choke point
│ ├── provider.py
│ ├── telemetry.py
│ ├── cost_model.py
│ ├── policies.py
│ ├── otel_setup.py
│ └── semconv.py
├── app/ ← reference FastAPI app
├── evals/ ← eval harness
├── reporting/ ← report generator
├── examples/ ← sample artifacts
└── tests/ ← 257 tests
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"Required environment variables:
export OPENAI_API_KEY="your_key" # if using OpenAI
export ANTHROPIC_API_KEY="your_key" # if using Anthropic
export APP_API_KEY="your_app_key"Optional configuration (defaults shown):
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # OTLP collector
OTEL_SDK_DISABLED=false # set true in CI
RATE_LIMIT_RPM=60
MAX_CONTEXT_TOKENS=8192Run the reference app:
uvicorn app.main:app --reloadInteractive docs at http://localhost:8000/docs.
Run tests:
OTEL_SDK_DISABLED=true python3 -m pytest tests/ -qLinting and type checking:
python3 -m ruff check .
python3 -m ruff format --check .
python3 -m mypy src/llmscope/ app/ evals/ reporting/ --ignore-missing-importsRun eval harness:
OTEL_SDK_DISABLED=true python3 -m evals.runners.run_classify_eval
OTEL_SDK_DISABLED=true python3 -m evals.runners.run_answer_routed_eval
OTEL_SDK_DISABLED=true python3 -m evals.runners.run_conversation_turn_evalGenerate report from example telemetry:
python3 -m reporting.make_report \
--after-log examples/sample_telemetry.jsonl \
--output examples/sample_report.mdNot implemented (envelope schema supports, but no runtime logic):
- Semantic cache — envelope has
cache_hit,cache_strategy,cache_key_fingerprintfields, but no cache implementation in gateway - Circuit breaker — envelope has
circuit_statefield, but no circuit breaker logic - Provider fallback — envelope has
fallback_triggered,fallback_reasonfields, butcall_llm()uses single provider per request - Policy engine — envelope has
policy_decision,policy_modefields, but no policy evaluation beyond route-level tier mapping - PII detection/redaction — envelope has
redaction_applied,pii_detectedfields, but no implementation
These are architectural affordances in the envelope contract, not working features. The envelope schema is forward-compatible; implementations can be added without breaking the contract.
Intentionally out of scope:
- Streaming responses
- Tool/function calling
- Multi-tenancy (single APP_API_KEY, no tenant isolation)
- Distributed rate limiting (in-memory only, no Redis)
- Semantic evaluation (schema checks only)
- Production deployment configuration
- Dashboard or UI
Routing classifier limitations:
The keyword-based classifier in app/services/routing.py is intentionally generic. Production use requires domain-specific keyword tuning and threshold calibration based on actual traffic analysis. The repository demonstrates the instrumentation infrastructure that makes that calibration loop measurable, not the calibrated classifier itself.
LLMScope follows Semantic Versioning. From v0.1.0:
Stable public API:
from llmscope import (
LLMRequestEnvelope, LLMRequestContext, GatewayResult,
ProviderBase, ProviderResponse,
call_llm, estimate_cost, setup_otel, shutdown_otel,
OpenAIProvider, AnthropicProvider,
)Provider contract: ProviderBase has 2 abstract methods (provider_name, complete) and 2 default methods (is_retryable, categorize_error). New optional methods with safe defaults may be added in minor versions. New abstract methods require a major version bump.
JSONL schema: telemetry events include schema_version field. New fields may appear in minor versions. Consumers should tolerate unknown keys.
OTel attributes: llmscope.* attribute names in src/llmscope/semconv.py are stable from v0.1.0. gen_ai.* attributes follow OpenTelemetry GenAI Semantic Conventions and may change via OTEL_SEMCONV_STABILITY_OPT_IN.
Cost model: pricing values are configuration, not API. May be updated in any version to reflect current provider pricing.
Planned (envelope schema already supports):
- Exact cache implementation (deterministic fingerprinting, hit/miss tracking)
- Circuit breaker (three-state machine, failure threshold, timeout)
- Provider fallback (primary → secondary on error)
- Policy engine (typed policy evaluation, budget gates)
Under consideration:
- SDK distribution (library-only, no separate process)
- Sidecar distribution (out-of-band event collection)
- Redis-backed rate limiting
- Embedding-based routing classifier
- llm-eval-gate — Evidence-based quality gate consuming LLMScope telemetry for deployment decisions
Specs under .kiro/ document architecture decisions, instrumentation boundaries, and semantic convention alignment. The repository reflects production experience with OpenTelemetry GenAI conventions and cost attribution in regulated environments.
MIT License - see LICENSE file for details.