Skip to content

feat: add Prometheus metrics and Grafana dashboard for observability#162

Open
rach-kanc wants to merge 5 commits into
TENET-DEV-AI:mainfrom
rach-kanc:main
Open

feat: add Prometheus metrics and Grafana dashboard for observability#162
rach-kanc wants to merge 5 commits into
TENET-DEV-AI:mainfrom
rach-kanc:main

Conversation

@rach-kanc

@rach-kanc rach-kanc commented Jun 11, 2026

Copy link
Copy Markdown

Summary

This PR instruments the ingest and analyzer microservices with Prometheus metrics to enable real-time observability, performance monitoring, and threat detection tracking. It also provides a ready-to-use Grafana dashboard template for SOC-style visibility.

Key Changes

  • Dependencies: Added prometheus_client to requirements.txt.
  • Shared Metrics Utility: Created services/utils/metrics.py which includes standard counters (REQUEST_COUNT, REQUEST_LATENCY, DETECTION_COUNT) and a custom PrometheusMiddleware for tracking FastAPI request throughput and latency automatically.
  • Ingest Service Updates: Mounted the /metrics endpoint, attached the middleware, and added detection incrementing logic based on heuristic verdicts.
  • Analyzer Service Updates: Mounted the /metrics endpoint, attached the middleware, and added detection counters to both synchronous requests and background event processing.
  • Grafana Dashboard: Added monitoring/grafana/tenet_dashboard.json for visualizing API request rates, endpoint latencies (p95), and the rate of different threat detections.

Related Issue

Fixes #100

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Summary by cubic

Adds Prometheus metrics to the ingest and analyzer services, exposes /metrics, and ships a Grafana dashboard for request rate, latency, and threat detections. Hardens ingest with a Redis circuit breaker and health monitoring; updates unit tests for ingest heuristics and endpoints.

  • New Features

    • Shared metrics utility and PrometheusMiddleware; /metrics in both services via prometheus_client.
    • Detection counters in analyzer (API + background) and ingest heuristics, labeled by service and threat type.
    • Grafana dashboard (monitoring/grafana/tenet_dashboard.json) for request rate, p95 latency, and detection rates.
    • ingest hardening: Redis circuit breaker, structured logging, and health checks.
    • Dependency: added prometheus-client.
    • Tests: updated unit tests for ingest endpoints, heuristics (including threat_type), and validation.
  • Migration

    • Configure Prometheus to scrape /metrics on ingest and analyzer.
    • Import the Grafana JSON and set the data source named “Prometheus”.

Written for commit 58b0e81. Summary will update on new commits.

Review in cubic

Summary by CodeRabbit

  • New Features
    • Added TENET AI Security & Performance Grafana dashboard with API request rates, p95 endpoint latency, and threat detection rate panels.
    • Services now expose real-time Prometheus metrics and record detection events (including threat type and verdict) for improved monitoring and alerting.
  • Chores
    • Added prometheus-client dependency to enable metrics collection.

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

@rach-kanc is attempting to deploy a commit to the s3dfx-cyber's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds Prometheus metrics and Grafana monitoring: a metrics module with request counters, latency histograms, and detection counters; PrometheusMiddleware and increment_detection; analyzer and ingest services register the middleware, mount /metrics, and emit detection metrics; and a Grafana dashboard JSON visualizes request rate, p95 latency, and threat detections.

Changes

Prometheus Metrics & Grafana Monitoring

Layer / File(s) Summary
Metrics infrastructure and middleware
requirements.txt, services/utils/metrics.py
Pins prometheus-client; adds REQUEST_COUNT, REQUEST_LATENCY, DETECTION_COUNT; increment_detection normalizes threat_type; PrometheusMiddleware records request counts and latencies and handles exceptions.
Analyzer service integration
services/analyzer/app.py
Registers PrometheusMiddleware, mounts /metrics (ASGI app), and calls increment_detection after analysis in /v1/analyze and in background processing with service labels analyzer / analyzer_bg.
Ingest service integration
services/ingest/app.py
Registers PrometheusMiddleware, mounts /metrics; updates quick_heuristic_check to return threat_type; LLM event ingestion unpacks threat_type and calls increment_detection(service="ingest", threat_type, verdict).
Grafana dashboard configuration
monitoring/grafana/tenet_dashboard.json
Adds dashboard TENET AI - Security & Performance (dark theme, last 1h, 5s refresh) with three Prometheus timeseries panels: API Request Rate (by endpoint/status), Endpoint Latency p95 (histogram_quantile), and Threat Detections (rate by threat_type/verdict).

Sequence Diagram

sequenceDiagram
  participant Client as Client/LoadBalancer
  participant Ingest as Ingest Service
  participant Analyzer as Analyzer Service
  participant Middleware as PrometheusMiddleware
  participant Metrics as Prometheus
  participant Grafana as Grafana Dashboard
  Client->>Analyzer: HTTP request
  Analyzer->>Middleware: request enters
  Middleware->>Analyzer: call_next(request)
  Analyzer->>Metrics: increment_detection(analyzer, threat_type, verdict)
  Middleware->>Metrics: record REQUEST_COUNT / REQUEST_LATENCY
  Client->>Ingest: ingest LLM event
  Ingest->>Ingest: quick_heuristic_check -> threat_type, verdict
  Ingest->>Metrics: increment_detection(ingest, threat_type, verdict)
  Metrics-->>Grafana: scrape /metrics (REQUEST_COUNT, REQUEST_LATENCY, DETECTION_COUNT)
  Grafana->>Grafana: visualize request rate, p95 latency, threat detections
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
Hops and counters, charts that glow,
I count each sniff, each high and low.
Prometheus sings while Grafana paints,
Threats tracked gently without complaints.
🥕🐇

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive The PR description mentions ingest service hardening (Redis circuit breaker, structured logging, health monitoring) not explicitly required by issue #100, requiring clarification on scope boundaries. Clarify whether Redis circuit breaker and health monitoring changes are intentional additions or separate concerns; document or remove if out of scope for the metrics PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add Prometheus metrics and Grafana dashboard for observability' clearly and concisely summarizes the main changes: adding Prometheus metrics and Grafana dashboard for observability.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from issue #100: Prometheus client integration, request counters, detection counters, latency histograms, /metrics endpoint exposure, and Grafana dashboard template provided.
Description check ✅ Passed The PR description comprehensively covers all required template sections: Summary, Key Changes, Related Issue, Type of Change, and includes detailed migration notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/ingest/app.py`:
- Around line 324-327: The current fragile mapping derives threat_type from
risk_score using magic thresholds; instead modify quick_heuristic_check to
return both the numeric risk_score and a canonical threat_type (e.g.,
"prompt_injection", "jailbreak", "data_extraction", "none"), update the call
site in services/ingest/app.py to capture that returned threat_type (instead of
computing it from risk_score) and pass it into
increment_detection(service="ingest", threat_type=threat_type, verdict=verdict),
and remove the hardcoded nested ternary that inferred threat_type from
risk_score so the source of truth comes from quick_heuristic_check.

In `@services/utils/metrics.py`:
- Around line 52-57: The except block that updates REQUEST_COUNT and
REQUEST_LATENCY (using variables method, endpoint, start_time and labels
REQUEST_COUNT/REQUEST_LATENCY) currently uses "raise e" which loses the original
traceback; change that to a bare "raise" so the original exception context is
preserved after you increment the metrics (i.e., keep the metric updates and
then use "raise" instead of "raise e").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a03de50b-2a7f-47cb-a186-8b751c912c3e

📥 Commits

Reviewing files that changed from the base of the PR and between 721686f and 154e595.

📒 Files selected for processing (5)
  • monitoring/grafana/tenet_dashboard.json
  • requirements.txt
  • services/analyzer/app.py
  • services/ingest/app.py
  • services/utils/metrics.py

Comment thread services/ingest/app.py
Comment thread services/utils/metrics.py Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread services/utils/metrics.py Outdated
Comment thread services/utils/metrics.py Outdated
@rach-kanc

Copy link
Copy Markdown
Author

@CodeRabbit

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@rach-kanc Hi! 👋 It looks like you've tagged me without a specific request. Here are a few things I can help you with on this PR:

  • Review the changes — I can perform a full code review of the Prometheus metrics and Grafana dashboard additions.
  • Summarize the PR — I can regenerate the AI-generated summary if needed.
  • Answer questions — Ask me anything about the changes, e.g., metrics design, middleware behavior, or Grafana dashboard configuration.
  • Create follow-up issues — If any improvements are out of scope for this PR, I can help open tracking issues.

Just let me know what you need, or I can go ahead and kick off a review!

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rach-kanc

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rach-kanc

Copy link
Copy Markdown
Author

@S3DFX-CYBER Is my PR ready to merge ? All AI checks passed 👍

@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
tenet-ai.org Ready Ready Preview, Comment Jun 11, 2026 6:39pm

@S3DFX-CYBER

Copy link
Copy Markdown
Collaborator

@rach-kanc unit tests are failing

@rach-kanc

rach-kanc commented Jun 12, 2026

Copy link
Copy Markdown
Author

@S3DFX-CYBER fixed
Do add label ssoc26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Prometheus Metrics and Health Monitoring

2 participants