Reference workload demonstrating runtime economics and operational governance for LLM inference requests.
A narrow reference application that consumes the llmscope runtime contract to apply local policy decisions, persist operational artifacts, and answer a fixed set of cost and performance questions.
This repository demonstrates pre-dispatch policy evaluation, structured cost attribution, append-only JSONL artifacts, and DuckDB-backed queries. It is not a product, platform, or observability tool.
This application answers:
- Which tenant or feature is burning the most margin per request?
- Which experiment increased cost without improving outcome?
- Which fallbacks or routing choices are masking latency?
- Which budget namespaces are triggering downgrades or denials?
- Which routes or features are no longer margin-safe?
Query results are derived from JSONL artifacts: telemetry.jsonl (emitted by llmscope) and policy_decisions.jsonl (emitted by this app).
Policy Engine (policy/engine.py)
- Three primitives:
budget_threshold,route_preference,cost_anomaly - Pre-dispatch evaluation returns allow, downgrade, or deny
- DuckDB queries against local telemetry JSONL for budget and anomaly checks
- YAML configuration with namespace isolation
Request Lifecycle (app/api.py)
- FastAPI endpoint validates requests, constructs
LLMRequestContext - Pre-dispatch cost estimation using llmscope public API
- Policy evaluation before provider call
- Deny path returns HTTP 402 without calling provider
- Downgrade path mutates model_tier to "cheap"
- Decision logging to JSONL with fcntl advisory locking
Query Layer (reporting/queries.py)
- Five DuckDB-backed queries answering canonical questions
- CLI interface:
python -m reporting.queries <query_name> - Handles missing or empty JSONL files gracefully
Testing (tests/)
- 72 tests across 8 modules
- No external API calls (mocked llmscope, fixture-based)
- CI runs on Python 3.11 and 3.12
- Provider configuration examples (llmscope core handles providers, not demonstrated here)
- Post-dispatch policy evaluation (only pre-dispatch exists)
- HTTP query API (queries are CLI-only)
- Policy hot reload (requires engine reload or restart)
- Authentication or RBAC
- Web dashboard or UI
- Windows support (fcntl locking is POSIX-only)
- HTTP POST to
/inferwith prompt, tenant_id, and attribution fields - FastAPI validates request, constructs
LLMRequestContext - Pre-dispatch cost estimation using
llmscope.get_model_for_tier()andestimate_cost() - Policy engine evaluates rules against local telemetry JSONL
- If
deny: return HTTP 402, log decision, stop - If
downgrade: override model_tier to "cheap" - If
allow: proceed unchanged - Call
llmscope.call_llm()with context - Log decision to
policy_decisions.jsonl - Return response to client
HTTP Request → FastAPI → Policy Engine → llmscope → Provider
↓ ↓ ↓
LLMRequestContext ↓ telemetry.jsonl
policy_decisions.jsonl
↓
DuckDB Queries
Three primitives are implemented in policy/engine.py:
Enforces spending limits per namespace over time windows (hourly or daily).
- id: "demo-hourly-cap"
primitive: budget_threshold
period: hourly
limit_usd: 1.00
action: deny
deny_reason: "Hourly budget exceeded"Actions: deny (reject request) or downgrade (switch to cheap tier).
Evaluates against local telemetry.jsonl using DuckDB query.
Downgrades requests on routes configured for cheap tiers.
- id: "demo-route-preference"
primitive: route_preference
route_name: "/answer-routed"
prefer_tier: cheapAlways downgrades if route matches and current tier is not cheap.
Alerts when estimated request cost exceeds historical baseline.
- id: "demo-cost-anomaly"
primitive: cost_anomaly
feature_id: "summarize"
baseline_window_hours: 24
threshold_multiplier: 3.0
action: alertAlways allows request (alert-only, never blocks).
Evaluates against local telemetry.jsonl using DuckDB query.
- Python 3.11 or 3.12
- POSIX-compatible OS (macOS, Linux) for fcntl file locking
git clone <repository-url>
cd llmscope-reference
pip install -e .[dev]export TELEMETRY_PATH="artifacts/logs/telemetry.jsonl"
export DECISIONS_PATH="artifacts/logs/policy_decisions.jsonl"
export OTEL_SDK_DISABLED=true # For local development without OTEL backenduvicorn app.main:app --reloadServer runs on http://localhost:8000.
curl -X POST http://localhost:8000/infer \
-H "Content-Type: application/json" \
-d '{
"prompt": "Summarize this document",
"tenant_id": "acme-corp",
"feature_id": "summarize",
"model_tier": "cheap"
}'Response:
{
"request_id": "req-20240324-abc123",
"answer": "Here is the summary...",
"selected_model": "gpt-4o-mini",
"estimated_cost_usd": 0.0023,
"tokens_in": 150,
"tokens_out": 75,
"policy_decision": "allow",
"effective_model_tier": "cheap"
}If policy denies the request:
{
"detail": "Request denied by policy: Hourly budget exceeded"
}HTTP status: 402 Payment Required
All queries are CLI-based and read from JSONL artifacts.
Which tenant or feature burns the most margin per request?
python -m reporting.queries cost_by_tenant_and_featureOutput: tenant_id, feature_id, total_cost_usd, avg_cost_usd, request_count
Which experiment increased cost without improving outcome?
python -m reporting.queries experiment_cost_vs_outcomeOutput: experiment_id, avg_tokens_in, avg_tokens_out, avg_cost_usd, success_rate, request_count
Success rate = proportion of requests with finish_reason='stop'
Which budget namespaces trigger downgrades or denials?
python -m reporting.queries budget_pressure_by_namespaceOutput: budget_namespace, allow_count, downgrade_count, deny_count, total_count
Which fallbacks or routing choices mask latency?
python -m reporting.queries fallback_latency_maskingOutput: route_name, is_fallback, p95_latency_ms, avg_latency_ms, request_count
Which routes are no longer economically safe?
python -m reporting.queries unsafe_routes --threshold 0.05Output: route_name, avg_cost_usd, max_cost_usd, request_count
Default threshold: $0.05 per request
# Run all tests
OTEL_SDK_DISABLED=true pytest -q
# Run specific test modules
pytest tests/test_policy.py -v
pytest tests/test_queries.py -v
pytest tests/test_api.py -v
pytest tests/test_telemetry_wiring.py -v72 tests across 8 modules. All tests use fixtures and mocks - no external API calls.
Test coverage:
- Policy engine evaluation (budget_threshold, route_preference, cost_anomaly)
- Request validation and context construction
- Decision logging with concurrent write safety
- DuckDB queries over fixture data
- API integration with mocked llmscope
- Telemetry wiring and pre-dispatch cost estimation
llmscope-reference/
├── app/ # FastAPI application
│ ├── api.py # POST /infer endpoint with policy integration
│ ├── main.py # App initialization, OTEL lifecycle
│ ├── schemas.py # Request/response Pydantic models
│ └── settings.py # Configuration (paths, env vars)
├── policy/ # Policy engine
│ ├── engine.py # YAMLPolicyEngine with three primitives
│ ├── loader.py # YAML config parsing and validation
│ ├── models.py # PolicyVerdict, PolicyDecisionRecord
│ └── log.py # JSONL decision logging with fcntl locking
├── reporting/ # Query layer
│ └── queries.py # Five DuckDB-backed canonical queries
├── config/
│ └── policy.yaml # Policy configuration (default, demo namespaces)
├── artifacts/logs/ # Operational artifacts (gitignored)
│ ├── telemetry.jsonl # Emitted by llmscope core
│ └── policy_decisions.jsonl # Emitted by this app
├── tests/ # 72 tests, no external dependencies
└── .github/workflows/
└── ci.yml # Python 3.11 and 3.12 test matrix
This repository consumes the llmscope runtime contract. It does not redefine or reimplement it.
llmscope (core library) owns:
- Provider abstraction and routing
- Cost estimation and normalization
- OpenTelemetry emission
- Telemetry artifact generation (
telemetry.jsonl) - Runtime types (
LLMRequestContext,GatewayResult)
llmscope-reference (this repo) owns:
- Concrete YAML policy engine
- Pre-dispatch policy evaluation
- Local decision artifacts (
policy_decisions.jsonl) - DuckDB operational queries
- Reference HTTP API surface
Integration point: llmscope.call_llm(..., context=LLMRequestContext(...))
All inference requests pass through the public llmscope API. No internal imports from llmscope are used except where documented as technical debt in code comments.
llmscope @ git+https://github.com/lucianareynaud/llmscope.git@5d3fdfbc
fastapi>=0.110
uvicorn[standard]>=0.29
pyyaml>=6.0
duckdb>=0.10
pydantic>=2.0
opentelemetry-instrumentation-fastapi>=0.45b0llmscope is pinned to SHA 5d3fdfbc (introduces LLMRequestContext with attribution fields).
Dev dependencies: pytest>=8.0, pytest-asyncio>=0.23, httpx>=0.27
- Pre-dispatch cost estimation: Uses rough token count approximation (word count * 1.3) for budget and anomaly checks
- No provider configuration examples: llmscope core handles providers, but setup is not demonstrated in this repo
- No post-dispatch policy: Only pre-dispatch evaluation is implemented
- No policy hot reload: Policy changes require engine reload or restart
- POSIX-only file locking: fcntl advisory locking is not portable to Windows
- No authentication or RBAC: This is a reference implementation, not a production service
- CLI-only queries: No HTTP query API (queries run via
python -m reporting.queries) - Local JSONL artifacts only: No database or distributed artifact store
- Web dashboard or UI
- Authentication or authorization
- Multi-tenant console
- Plugin system or policy DSL
- Generic analytics platform
- Distributed system architecture
- Streaming inference
- Agent orchestration
- Evaluation pipelines or LLM-as-judge workflows
These are not planned for the first release.
Potential extensions (not committed):
- Provider setup documentation
- Post-dispatch policy evaluation (e.g., output validation)
- Policy reload HTTP endpoint
- HTTP query API (expose queries via REST endpoints)
- Windows compatibility (replace fcntl with cross-platform locking)
Reference implementation for demonstration purposes.