Skip to content

Conversation

@Akkikens
Copy link
Contributor

@Akkikens Akkikens commented Nov 26, 2025

feat(datadog): infrastructure health dashboard and metrics ingestion

implements datadog ingestion pipeline for infrastructure health metrics per plan:

metrics are flowing from my branch and visible in datadog. ready for review!
Asana Task

@Akkikens Akkikens self-assigned this Nov 26, 2025
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Comment on lines 218 to 233
- name: training-health-cronjob
<<: *cronjob_template
values:
- schedule: "0 * * * *" # Hourly
- image:
tag: latest-dashboard
- datadog:
service: training-health-collector
- args:
- collect
- training
- --push
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training-health-cronjob is missing the command field but includes args. In Kubernetes, when args is specified without command, it replaces the Dockerfile's CMD entirely. This means the pod will try to execute ["collect", "training", "--push"] directly instead of ["python", "-m", "devops.datadog.cli", "collect", "training", "--push"], causing immediate failure.

Fix: Add the command field:

- command:
    - python
    - -m
    - devops.datadog.cli
- args:
    - collect
    - training
    - --push
Suggested change
- name: training-health-cronjob
<<: *cronjob_template
values:
- schedule: "0 * * * *" # Hourly
- image:
tag: latest-dashboard
- datadog:
service: training-health-collector
- args:
- collect
- training
- --push
- name: training-health-cronjob
<<: *cronjob_template
values:
- schedule: "0 * * * *" # Hourly
- image:
tag: latest-dashboard
- datadog:
service: training-health-collector
- command:
- python
- -m
- devops.datadog.cli
- args:
- collect
- training
- --push

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Comment on lines 315 to 327
def _count_other_workflows_failing(self) -> int:
"""Count workflows (excluding tests/benchmarks) whose latest run is failing."""
all_workflows = self.github.list_workflows(self.config.repo)
exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)

failing_count = 0
for workflow in all_workflows:
workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
workflow_id = workflow.get("id")

# Skip if this is a test/benchmark workflow
if workflow_name in exclude_workflows:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow exclusion logic is broken when workflows are configured by ID. The exclude_workflows set contains workflow IDs as strings (e.g., "163901189" from CI_TESTS_BLOCKING_MERGE_WORKFLOWS), but line 326 compares against workflow_name which is the workflow's name string from the API (e.g., "Test and Benchmark"). This mismatch means workflows configured by ID will never be excluded, causing incorrect failing workflow counts.

Fix: Also check against workflow_id:

exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)

failing_count = 0
for workflow in all_workflows:
    workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
    workflow_id = str(workflow.get("id", ""))

    # Skip if this workflow matches by name OR id
    if workflow_name in exclude_workflows or workflow_id in exclude_workflows:
        continue
Suggested change
def _count_other_workflows_failing(self) -> int:
"""Count workflows (excluding tests/benchmarks) whose latest run is failing."""
all_workflows = self.github.list_workflows(self.config.repo)
exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)
failing_count = 0
for workflow in all_workflows:
workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
workflow_id = workflow.get("id")
# Skip if this is a test/benchmark workflow
if workflow_name in exclude_workflows:
continue
def _count_other_workflows_failing(self) -> int:
"""Count workflows (excluding tests/benchmarks) whose latest run is failing."""
all_workflows = self.github.list_workflows(self.config.repo)
exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)
failing_count = 0
for workflow in all_workflows:
workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
workflow_id = str(workflow.get("id", ""))
# Skip if this workflow matches by name OR id
if workflow_name in exclude_workflows or workflow_id in exclude_workflows:
continue

Spotted by Graphite Agent

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@Akkikens Akkikens force-pushed the akshay/datadog-dashboard branch 2 times, most recently from fcf31e7 to 6e76dcb Compare November 26, 2025 20:35
Introduce MetricSample and MetricKind types for Datadog metric
ingestion. Establishes foundation for infrastructure health monitoring.
…culations

Provide helpers for parsing GitHub timestamps, ISO8601 dates, and
computing percentiles for CI duration metrics.
Add abstract BaseCollector class with standardized metric naming,
tag generation, and sample building. Includes registry pattern for
collector discovery.
Implement client for querying GitHub Actions workflows, runs, and
PR search. Supports workflow discovery and latest run status checks.
Add client for submitting metrics to Datadog via Metrics API v2.
Handles authentication via environment variables or AWS Secrets Manager.
Supports all metric types (gauge, count, distribution).
Add collector for GitHub Actions CI metrics including:
- Workflow success rates and flakiness
- P90 duration tracking
- Cancelled job counts
- Tests blocking merge status
- Benchmarks status
- Other workflow failure counts
- Weekly PR metrics (hotfixes, force merges, reverts)

Configurable via environment variables with thresholds matching
infrastructure health requirements.
Implement collectors for stable-suite training and evaluation metrics.
Read from JSON files published by training/eval pipelines and convert
to standardized Datadog metric format.
Implement Typer-based CLI with commands for:
- Listing available collectors
- Running collectors with dry-run mode
- Pushing metrics to Datadog
- Listing workflows for configuration

Supports --dry-run for testing and --push for production.
Include example JSON schemas and sample data for testing collectors.
Documents expected format for stable-suite pipeline outputs.
Build container image for Datadog metric collection. Installs
workspace dependencies (softmax, metta-common) and sets up
Python environment for collectors.
Add support for DD_ENV, DD_SERVICE, DD_VERSION, and DD_SITE
environment variables. Extend extraEnv support for collector
configuration.
Add dashboard-cronjob release with:
- 10-minute schedule for CI metrics
- Workflow IDs for tests blocking merge and benchmarks
- Datadog service tagging
- Proper image tagging for CI/CD integration
Replace all relative imports (..models, .base) with absolute imports
(devops.datadog.models, devops.datadog.collectors.base) to comply
with TID252 linting rule. Fixes 20 linting errors.
- Change Helm chart to use existing 'standard-service-account' IAM role instead of non-existent 'monitoring-cronjobs'
- Update Terraform to allow monitoring namespace service accounts to assume the role
- This enables cronjobs in monitoring namespace to access AWS Secrets Manager for GitHub token
- Add command field to training-health-cronjob and eval-health-cronjob
- Fix workflow exclusion logic to check both workflow name and ID
fallback to GAUGE if DISTRIBUTION is not available in datadog-api-client version
@Akkikens Akkikens force-pushed the akshay/datadog-dashboard branch from 6e76dcb to a961ffa Compare November 26, 2025 21:07
@Akkikens
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex review is not enabled for this repo. Please contact the admins of this repo to enable Codex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants