-
Notifications
You must be signed in to change notification settings - Fork 44
Akshay/datadog dashboard #4054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Akshay/datadog dashboard #4054
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| - name: training-health-cronjob | ||
| <<: *cronjob_template | ||
| values: | ||
| - schedule: "0 * * * *" # Hourly | ||
| - image: | ||
| tag: latest-dashboard | ||
| - datadog: | ||
| service: training-health-collector | ||
| - args: | ||
| - collect | ||
| - training | ||
| - --push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The training-health-cronjob is missing the command field but includes args. In Kubernetes, when args is specified without command, it replaces the Dockerfile's CMD entirely. This means the pod will try to execute ["collect", "training", "--push"] directly instead of ["python", "-m", "devops.datadog.cli", "collect", "training", "--push"], causing immediate failure.
Fix: Add the command field:
- command:
- python
- -m
- devops.datadog.cli
- args:
- collect
- training
- --push| - name: training-health-cronjob | |
| <<: *cronjob_template | |
| values: | |
| - schedule: "0 * * * *" # Hourly | |
| - image: | |
| tag: latest-dashboard | |
| - datadog: | |
| service: training-health-collector | |
| - args: | |
| - collect | |
| - training | |
| - --push | |
| - name: training-health-cronjob | |
| <<: *cronjob_template | |
| values: | |
| - schedule: "0 * * * *" # Hourly | |
| - image: | |
| tag: latest-dashboard | |
| - datadog: | |
| service: training-health-collector | |
| - command: | |
| - python | |
| - -m | |
| - devops.datadog.cli | |
| - args: | |
| - collect | |
| - training | |
| - --push | |
Spotted by Graphite Agent
Is this helpful? React 👍 or 👎 to let us know.
| def _count_other_workflows_failing(self) -> int: | ||
| """Count workflows (excluding tests/benchmarks) whose latest run is failing.""" | ||
| all_workflows = self.github.list_workflows(self.config.repo) | ||
| exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows) | ||
|
|
||
| failing_count = 0 | ||
| for workflow in all_workflows: | ||
| workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1] | ||
| workflow_id = workflow.get("id") | ||
|
|
||
| # Skip if this is a test/benchmark workflow | ||
| if workflow_name in exclude_workflows: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workflow exclusion logic is broken when workflows are configured by ID. The exclude_workflows set contains workflow IDs as strings (e.g., "163901189" from CI_TESTS_BLOCKING_MERGE_WORKFLOWS), but line 326 compares against workflow_name which is the workflow's name string from the API (e.g., "Test and Benchmark"). This mismatch means workflows configured by ID will never be excluded, causing incorrect failing workflow counts.
Fix: Also check against workflow_id:
exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows)
failing_count = 0
for workflow in all_workflows:
workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1]
workflow_id = str(workflow.get("id", ""))
# Skip if this workflow matches by name OR id
if workflow_name in exclude_workflows or workflow_id in exclude_workflows:
continue| def _count_other_workflows_failing(self) -> int: | |
| """Count workflows (excluding tests/benchmarks) whose latest run is failing.""" | |
| all_workflows = self.github.list_workflows(self.config.repo) | |
| exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows) | |
| failing_count = 0 | |
| for workflow in all_workflows: | |
| workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1] | |
| workflow_id = workflow.get("id") | |
| # Skip if this is a test/benchmark workflow | |
| if workflow_name in exclude_workflows: | |
| continue | |
| def _count_other_workflows_failing(self) -> int: | |
| """Count workflows (excluding tests/benchmarks) whose latest run is failing.""" | |
| all_workflows = self.github.list_workflows(self.config.repo) | |
| exclude_workflows = set(self.config.tests_blocking_merge_workflows) | set(self.config.benchmarks_workflows) | |
| failing_count = 0 | |
| for workflow in all_workflows: | |
| workflow_name = workflow.get("name") or workflow.get("path", "").split("/")[-1] | |
| workflow_id = str(workflow.get("id", "")) | |
| # Skip if this workflow matches by name OR id | |
| if workflow_name in exclude_workflows or workflow_id in exclude_workflows: | |
| continue |
Spotted by Graphite Agent
Is this helpful? React 👍 or 👎 to let us know.
fcf31e7 to
6e76dcb
Compare
Introduce MetricSample and MetricKind types for Datadog metric ingestion. Establishes foundation for infrastructure health monitoring.
…culations Provide helpers for parsing GitHub timestamps, ISO8601 dates, and computing percentiles for CI duration metrics.
Add abstract BaseCollector class with standardized metric naming, tag generation, and sample building. Includes registry pattern for collector discovery.
Implement client for querying GitHub Actions workflows, runs, and PR search. Supports workflow discovery and latest run status checks.
Add client for submitting metrics to Datadog via Metrics API v2. Handles authentication via environment variables or AWS Secrets Manager. Supports all metric types (gauge, count, distribution).
Add collector for GitHub Actions CI metrics including: - Workflow success rates and flakiness - P90 duration tracking - Cancelled job counts - Tests blocking merge status - Benchmarks status - Other workflow failure counts - Weekly PR metrics (hotfixes, force merges, reverts) Configurable via environment variables with thresholds matching infrastructure health requirements.
Implement collectors for stable-suite training and evaluation metrics. Read from JSON files published by training/eval pipelines and convert to standardized Datadog metric format.
Implement Typer-based CLI with commands for: - Listing available collectors - Running collectors with dry-run mode - Pushing metrics to Datadog - Listing workflows for configuration Supports --dry-run for testing and --push for production.
Include example JSON schemas and sample data for testing collectors. Documents expected format for stable-suite pipeline outputs.
Build container image for Datadog metric collection. Installs workspace dependencies (softmax, metta-common) and sets up Python environment for collectors.
Add support for DD_ENV, DD_SERVICE, DD_VERSION, and DD_SITE environment variables. Extend extraEnv support for collector configuration.
Add dashboard-cronjob release with: - 10-minute schedule for CI metrics - Workflow IDs for tests blocking merge and benchmarks - Datadog service tagging - Proper image tagging for CI/CD integration
Replace all relative imports (..models, .base) with absolute imports (devops.datadog.models, devops.datadog.collectors.base) to comply with TID252 linting rule. Fixes 20 linting errors.
…x CronJob command" This reverts commit 0f29805.
- Change Helm chart to use existing 'standard-service-account' IAM role instead of non-existent 'monitoring-cronjobs' - Update Terraform to allow monitoring namespace service accounts to assume the role - This enables cronjobs in monitoring namespace to access AWS Secrets Manager for GitHub token
- Add command field to training-health-cronjob and eval-health-cronjob - Fix workflow exclusion logic to check both workflow name and ID
fallback to GAUGE if DISTRIBUTION is not available in datadog-api-client version
6e76dcb to
a961ffa
Compare
|
@codex review |
|
Codex review is not enabled for this repo. Please contact the admins of this repo to enable Codex. |
feat(datadog): infrastructure health dashboard and metrics ingestion
implements datadog ingestion pipeline for infrastructure health metrics per plan:
metrics are flowing from my branch and visible in datadog. ready for review!
Asana Task