Akshay/datadog dashboard #4054

Akkikens · 2025-11-26T04:45:38Z

feat(datadog): infrastructure health dashboard and metrics ingestion

implements datadog ingestion pipeline for infrastructure health metrics per plan:

CI metrics collector (workflow success, duration, flakiness, PR stats)
training & eval health collectors
kubernetes CronJob deployment with GitHub Actions CI/CD
dashboard deployed and showing real metrics: https://p.datadoghq.com/sb/d5c8e3aa-6cc1-11f0-8647-42f10350ee07-c29d3c5ffe74823278e9234cc4729b74

metrics are flowing from my branch and visible in datadog. ready for review!
Asana Task

chatgpt-codex-connector · 2025-11-26T04:45:46Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

graphite-app · 2025-11-26T04:50:50Z

devops/charts/helmfile.yaml

+  - name: training-health-cronjob
+    <<: *cronjob_template
+    values:
+      - schedule: "0 * * * *"  # Hourly
+      - image:
+          tag: latest-dashboard
+      - datadog:
+          service: training-health-collector
+      - args:
+          - collect
+          - training
+          - --push


The training-health-cronjob is missing the command field but includes args. In Kubernetes, when args is specified without command, it replaces the Dockerfile's CMD entirely. This means the pod will try to execute ["collect", "training", "--push"] directly instead of ["python", "-m", "devops.datadog.cli", "collect", "training", "--push"], causing immediate failure.

Fix: Add the command field:

- command: - python - -m - devops.datadog.cli - args: - collect - training - --push

Suggested change

- name: training-health-cronjob

<<: *cronjob_template

values:

- schedule: "0 * * * *" # Hourly

- image:

tag: latest-dashboard

- datadog:

service: training-health-collector

- args:

- collect

- training

- --push

- name: training-health-cronjob

<<: *cronjob_template

values:

- schedule: "0 * * * *" # Hourly

- image:

tag: latest-dashboard

- datadog:

service: training-health-collector

- command:

- python

- -m

- devops.datadog.cli

- args:

- collect

- training

- --push

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-11-26T04:50:51Z

devops/datadog/collectors/ci_collector.py

+    def _count_other_workflows_failing(self) -> int:
+        """Count workflows (excluding tests/benchmarks) whose latest run is failing."""
+        all_workflows = self.github.list_workflows(self.config.repo)
+        exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)
+
+        failing_count = 0
+        for workflow in all_workflows:
+            workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
+            workflow_id = workflow.get("id")
+
+            # Skip if this is a test/benchmark workflow
+            if workflow_name in exclude_workflows:
+                continue


Workflow exclusion logic is broken when workflows are configured by ID. The exclude_workflows set contains workflow IDs as strings (e.g., "163901189" from CI_TESTS_BLOCKING_MERGE_WORKFLOWS), but line 326 compares against workflow_name which is the workflow's name string from the API (e.g., "Test and Benchmark"). This mismatch means workflows configured by ID will never be excluded, causing incorrect failing workflow counts.

Fix: Also check against workflow_id:

exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows) failing_count = 0 for workflow in all_workflows: workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1] workflow_id = str(workflow.get("id", "")) # Skip if this workflow matches by name OR id if workflow_name in exclude_workflows or workflow_id in exclude_workflows: continue

Suggested change

def _count_other_workflows_failing(self) -> int:

"""Count workflows (excluding tests/benchmarks) whose latest run is failing."""

all_workflows = self.github.list_workflows(self.config.repo)

exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)

failing_count = 0

for workflow in all_workflows:

workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]

workflow_id = workflow.get("id")

# Skip if this is a test/benchmark workflow

if workflow_name in exclude_workflows:

continue

def _count_other_workflows_failing(self) -> int:

"""Count workflows (excluding tests/benchmarks) whose latest run is failing."""

all_workflows = self.github.list_workflows(self.config.repo)

exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)

failing_count = 0

for workflow in all_workflows:

workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]

workflow_id = str(workflow.get("id", ""))

# Skip if this workflow matches by name OR id

if workflow_name in exclude_workflows or workflow_id in exclude_workflows:

continue

Spotted by Graphite Agent

Is this helpful? React 👍 or 👎 to let us know.

Introduce MetricSample and MetricKind types for Datadog metric ingestion. Establishes foundation for infrastructure health monitoring.

…culations Provide helpers for parsing GitHub timestamps, ISO8601 dates, and computing percentiles for CI duration metrics.

Add abstract BaseCollector class with standardized metric naming, tag generation, and sample building. Includes registry pattern for collector discovery.

Implement client for querying GitHub Actions workflows, runs, and PR search. Supports workflow discovery and latest run status checks.

Add client for submitting metrics to Datadog via Metrics API v2. Handles authentication via environment variables or AWS Secrets Manager. Supports all metric types (gauge, count, distribution).

Add collector for GitHub Actions CI metrics including: - Workflow success rates and flakiness - P90 duration tracking - Cancelled job counts - Tests blocking merge status - Benchmarks status - Other workflow failure counts - Weekly PR metrics (hotfixes, force merges, reverts) Configurable via environment variables with thresholds matching infrastructure health requirements.

Implement collectors for stable-suite training and evaluation metrics. Read from JSON files published by training/eval pipelines and convert to standardized Datadog metric format.

Implement Typer-based CLI with commands for: - Listing available collectors - Running collectors with dry-run mode - Pushing metrics to Datadog - Listing workflows for configuration Supports --dry-run for testing and --push for production.

Include example JSON schemas and sample data for testing collectors. Documents expected format for stable-suite pipeline outputs.

Build container image for Datadog metric collection. Installs workspace dependencies (softmax, metta-common) and sets up Python environment for collectors.

Add support for DD_ENV, DD_SERVICE, DD_VERSION, and DD_SITE environment variables. Extend extraEnv support for collector configuration.

Add dashboard-cronjob release with: - 10-minute schedule for CI metrics - Workflow IDs for tests blocking merge and benchmarks - Datadog service tagging - Proper image tagging for CI/CD integration

Replace all relative imports (..models, .base) with absolute imports (devops.datadog.models, devops.datadog.collectors.base) to comply with TID252 linting rule. Fixes 20 linting errors.

…b command

…x CronJob command" This reverts commit 0f29805.

…b command

- Change Helm chart to use existing 'standard-service-account' IAM role instead of non-existent 'monitoring-cronjobs' - Update Terraform to allow monitoring namespace service accounts to assume the role - This enables cronjobs in monitoring namespace to access AWS Secrets Manager for GitHub token

- Add command field to training-health-cronjob and eval-health-cronjob - Fix workflow exclusion logic to check both workflow name and ID

fallback to GAUGE if DISTRIBUTION is not available in datadog-api-client version

Akkikens · 2025-11-26T23:33:47Z

@codex review

chatgpt-codex-connector · 2025-11-26T23:33:53Z

Codex review is not enabled for this repo. Please contact the admins of this repo to enable Codex.

Akkikens self-assigned this Nov 26, 2025

graphite-app bot reviewed Nov 26, 2025

View reviewed changes

Akkikens force-pushed the akshay/datadog-dashboard branch 2 times, most recently from fcf31e7 to 6e76dcb Compare November 26, 2025 20:35

Akkikens added 19 commits November 26, 2025 16:07

feat(datadog): add core metric models and package structure

bd87b82

Introduce MetricSample and MetricKind types for Datadog metric ingestion. Establishes foundation for infrastructure health monitoring.

feat(datadog): add utility functions for timestamp and percentile cal…

018cfde

…culations Provide helpers for parsing GitHub timestamps, ISO8601 dates, and computing percentiles for CI duration metrics.

feat(datadog): implement base collector framework

087baf5

Add abstract BaseCollector class with standardized metric naming, tag generation, and sample building. Includes registry pattern for collector discovery.

feat(datadog): add GitHub API client for workflow and PR data

59cd3b0

Implement client for querying GitHub Actions workflows, runs, and PR search. Supports workflow discovery and latest run status checks.

feat(datadog): implement Datadog Metrics API v2 client

8d86030

Add client for submitting metrics to Datadog via Metrics API v2. Handles authentication via environment variables or AWS Secrets Manager. Supports all metric types (gauge, count, distribution).

feat(datadog): add training and eval health collectors

747c640

Implement collectors for stable-suite training and evaluation metrics. Read from JSON files published by training/eval pipelines and convert to standardized Datadog metric format.

feat(datadog): add sample data files for training/eval collectors

143a8ef

Include example JSON schemas and sample data for testing collectors. Documents expected format for stable-suite pipeline outputs.

feat(k8s): add Dockerfile for dashboard collector cronjob

1a357c6

Build container image for Datadog metric collection. Installs workspace dependencies (softmax, metta-common) and sets up Python environment for collectors.

feat(k8s): update cronjob chart for Datadog environment variables

a2d7685

Add support for DD_ENV, DD_SERVICE, DD_VERSION, and DD_SITE environment variables. Extend extraEnv support for collector configuration.

feat(k8s): configure dashboard cronjob with workflow IDs

208f74b

Add dashboard-cronjob release with: - 10-minute schedule for CI metrics - Workflow IDs for tests blocking merge and benchmarks - Datadog service tagging - Proper image tagging for CI/CD integration

fix(datadog): convert relative imports to absolute imports

66d0965

Replace all relative imports (..models, .base) with absolute imports (devops.datadog.models, devops.datadog.collectors.base) to comply with TID252 linting rule. Fixes 20 linting errors.

fix: Add GitHub token loading from AWS Secrets Manager and fix CronJo…

c2a0359

…b command

Revert "fix: Add GitHub token loading from AWS Secrets Manager and fi…

10187e5

…x CronJob command" This reverts commit 0f29805.

fix: Add GitHub token loading from AWS Secrets Manager and fix CronJo…

8bc2a56

…b command

fix: Address code review feedback

650d201

- Add command field to training-health-cronjob and eval-health-cronjob - Fix workflow exclusion logic to check both workflow name and ID

fix: handle MetricIntakeType.DISTRIBUTION gracefully

a961ffa

fallback to GAUGE if DISTRIBUTION is not available in datadog-api-client version

Akkikens force-pushed the akshay/datadog-dashboard branch from 6e76dcb to a961ffa Compare November 26, 2025 21:07

Akkikens added 5 commits November 26, 2025 16:14

fix: add build-time verification for devops.datadog import

0886764

fix: update Datadog site to us3.datadoghq.com

f5e08db

debug: add detailed logging for metric submission

9793690

debug: add detailed logging for API key loading from Secrets Manager

d7321a4

style: remove trailing whitespace

163c030

Merge branch 'main' into akshay/datadog-dashboard

d8d3e76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Akshay/datadog dashboard #4054

Akshay/datadog dashboard #4054

Akkikens commented Nov 26, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

graphite-app bot Nov 26, 2025

Uh oh!

graphite-app bot Nov 26, 2025

Uh oh!

Akkikens commented Nov 26, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Akshay/datadog dashboard #4054

Are you sure you want to change the base?

Akshay/datadog dashboard #4054

Conversation

Akkikens commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

graphite-app bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Akkikens commented Nov 26, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Akkikens commented Nov 26, 2025 •

edited

Loading