From 71e2d80d28569948e4270941445fac8bcf4168d9 Mon Sep 17 00:00:00 2001 From: Lucas Pimentel Date: Fri, 24 Oct 2025 10:51:50 -0400 Subject: [PATCH 1/3] [docs] Improve CI troubleshooting documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add new sections to help developers investigate test failures: 1. **Determining If Failures Are Related to Your Changes** - How to compare builds before/after your commit - Commands to list recent master builds and compare failed tasks - Explanation of master-only tests that don't run on PRs 2. **Understanding Test Infrastructure** - Finding test configuration and environment variable setup - Common gotchas (e.g., profiler tests disabling tracer) - Cross-cutting test failures between components - Tracing error messages to source code These additions document investigation techniques used to diagnose profiler test failures after PR #7568, including: - Comparing master builds to determine if failures are new vs pre-existing - Reading test infrastructure code to understand environment setup - Understanding how changes in one component (tracer) can affect tests for another component (profiler) - Using grep to trace error messages back to source code 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../CI/TroubleshootingCIFailures.md | 173 +++++++++++++++++- 1 file changed, 172 insertions(+), 1 deletion(-) diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md index 13e0bc1e956e..70193fcdb34f 100644 --- a/docs/development/CI/TroubleshootingCIFailures.md +++ b/docs/development/CI/TroubleshootingCIFailures.md @@ -73,18 +73,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a ### Download and search logs +#### Option 1: Using curl (works everywhere) + ```bash curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/" \ | grep -i "fail\|error" ``` +#### Option 2: Using Azure CLI (recommended on Windows) + +```bash +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/?api-version=7.0" \ + 2>&1 | grep -i "fail\|error" +``` + +Note: You may see a warning about authentication - this is safe to ignore for public builds. + +#### Option 3: Using GitHub CLI for quick overview + +```bash +# Get quick summary of all checks for a PR +gh pr checks + +# Get detailed PR status including links to Azure DevOps +gh pr view --json statusCheckRollup +``` + ### Get detailed context around failures +Using curl: + ```bash curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/" \ | grep -A 30 "TestName.That.Failed" ``` +Or with Azure CLI: + +```bash +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/?api-version=7.0" \ + 2>&1 | grep -A 30 "TestName.That.Failed" +``` + ## Mapping Commits to Builds Azure DevOps builds test **merge commits** (`refs/pull//merge`), not branch commits directly. @@ -107,8 +137,128 @@ To find which branch commit caused a failure: The build queued shortly after the commit was pushed is likely testing that commit. +## Determining If Failures Are Related to Your Changes + +When tests fail on master after your PR is merged, determine if failures are new or pre-existing: + +### Compare with previous build on master + +```bash +# List recent builds on master +az pipelines runs list \ + --organization https://dev.azure.com/datadoghq \ + --project dd-trace-dotnet \ + --branch master \ + --top 10 \ + --query "[].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \ + --output table + +# Find the build for the commit before yours +git log --oneline HEAD~1..HEAD # Identify your commit and the previous one + +# Compare failed tasks between builds +# Your build: +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.result == "failed") | .name' + +# Previous build: +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.result == "failed") | .name' +``` + +**New failures** only appear in your build → likely related to your changes. +**Same failures** appear in both → likely pre-existing/flaky tests. + +### Master-only tests + +Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR: + +1. This is expected - those tests don't run on PRs +2. Compare with the previous successful master build to confirm they're new +3. The failures are likely related to your changes + +## Understanding Test Infrastructure + +When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured. + +### Finding test configuration + +Tests may set up environments differently than production code. For example: + +```bash +# Find how a specific test sets up environment variables +grep -r "DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/ + +# Look for test helper classes +find . -name "*EnvironmentHelper*.cs" -o -name "*TestRunner*.cs" + +# Check what environment variables a test actually sets +# Read the test code path from failing test name: +# Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke +# Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs +``` + +**Common gotchas:** +- Profiler tests may disable the tracer (`DD_TRACE_ENABLED=0`) +- Different test suites (tracer vs profiler) have different configurations +- Test environment may not match production deployment + +### Cross-cutting test failures + +Changes in one component may affect tests for another component: + +- **Managed tracer changes** may affect profiler tests (they share the managed loader) +- **Native changes** may affect managed tests (if they change initialization order) +- **Environment variable handling** may affect both tracer and profiler + +**Investigation strategy:** +1. Identify which component the failing test is for (tracer, profiler, debugger, etc.) +2. Compare with your changes - do they touch shared infrastructure? +3. Check if test configuration differs from production (e.g., disabled features) +4. Trace through initialization code to find the interaction point + +### Tracing error messages to source code + +When you find an error message in logs, trace it back to source code: + +```bash +# Search for the error message across the codebase +grep -r "One or multiple services failed to start" . + +# Example output: +# profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710 +# Log::Error("One or multiple services failed to start after a delay..."); +``` + +This helps you understand: +- Which component is logging the error (native/managed, tracer/profiler) +- The context of the failure (initialization, shutdown, runtime) +- Related code that might be affected + ## Common Test Failure Patterns +### Infrastructure Failures (Not Your Code) + +Some failures are infrastructure-related and can be retried without code changes: + +#### Docker Rate Limiting + +``` +toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit +``` + +**Solution**: Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue. + +#### Timeout/Network Issues + +``` +##[error]The job running on runner X has exceeded the maximum execution time +TLS handshake timeout +Connection reset by peer +``` + +**Solution**: Retry the failed job. These are typically transient network issues. + ### Unit Test Failures Failed unit tests typically appear in logs as: @@ -159,6 +309,21 @@ Common verification failures: ## Example Investigation Workflow +### Quick Investigation (GitHub CLI) + +```bash +# 1. Get quick overview of all checks +gh pr checks + +# 2. If Azure DevOps checks failed, check the logs directly +# Get the build ID from the Azure DevOps URL in the output above, then: +BUILD_ID= +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/?api-version=7.0" \ + 2>&1 | grep -i "error\|fail\|toomanyrequests" +``` + +### Detailed Investigation + ```bash # 1. Find your PR number gh pr list --head @@ -175,11 +340,17 @@ BUILD_ID= curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/timeline" \ | jq -r '.records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"' -# 4. Download and examine logs +# 4. Download and examine logs (choose one method) LOG_ID= + +# Using curl: curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}" \ | grep -A 30 "FAIL" +# Or using Azure CLI (Windows): +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}?api-version=7.0" \ + 2>&1 | grep -A 30 "FAIL" + # 5. Open build in browser for full details open "https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID}" ``` From b68a012a241fbb8442670f18a967c50b8ccecfc7 Mon Sep 17 00:00:00 2001 From: Lucas Pimentel Date: Fri, 24 Oct 2025 15:16:22 -0400 Subject: [PATCH 2/3] [docs] Add documentation for flaky Alpine profiler stack walking failures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document known intermittent profiler failures on Alpine/musl when walking exception stacks. These failures appear in smoke tests with error codes like E_FAIL (80004005) or CORPROF_E_STACKSNAPSHOT_UNSAFE. The errors are caused by race conditions during stack unwinding on musl-based platforms and can be resolved by retrying the failed job. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../CI/TroubleshootingCIFailures.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md index 70193fcdb34f..a14345401f9e 100644 --- a/docs/development/CI/TroubleshootingCIFailures.md +++ b/docs/development/CI/TroubleshootingCIFailures.md @@ -281,6 +281,25 @@ Integration test failures may indicate: Check the specific integration test logs for details about which service or scenario failed. +#### Flaky Profiler Stack Walking Failures (Alpine/musl) + +**Symptom:** +``` +Failed to walk N stacks for sampled exception: E_FAIL (80004005) +``` +or +``` +Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE +``` + +**Appears in**: Smoke tests on Alpine Linux (musl libc), particularly `installer_smoke_tests` → `linux alpine_3_1-alpine3_14` + +**Cause**: Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently. + +**Solution**: Retry the failed job. The smoke test check `CheckSmokeTestsForErrors` has an allowlist for known patterns, but some error codes like `E_FAIL` may occasionally slip through. + +**Note**: The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed. + ### Build Failures Build failures typically show: From bc4c8df87498f085047589cd6b5b64f7890fad39 Mon Sep 17 00:00:00 2001 From: Lucas Pimentel Date: Fri, 24 Oct 2025 15:49:59 -0400 Subject: [PATCH 3/3] [docs] Improve CI troubleshooting documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Enhanced the CI troubleshooting guide with: - More detailed guidance on finding and analyzing failed jobs - Step-by-step instructions for downloading and examining logs - Common failure patterns and their solutions - Better structure and navigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../CI/TroubleshootingCIFailures.md | 73 +++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md index a14345401f9e..06e03137d2a8 100644 --- a/docs/development/CI/TroubleshootingCIFailures.md +++ b/docs/development/CI/TroubleshootingCIFailures.md @@ -54,6 +54,37 @@ The human-readable build URL format is: https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId= ``` +### Using Azure DevOps MCP (AI Assistant Integration) + +If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries: + +#### Get build information +Ask your assistant to use `mcp__azure-devops__pipelines_get_builds` with: +- `project: "dd-trace-dotnet"` +- `buildIds: []` + +This returns structured build data including status, result, queue time, and trigger information. + +#### Get build logs +Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log` with: +- `project: "dd-trace-dotnet"` +- `buildId: ` + +Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs. + +#### Get specific log by ID +Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log_by_id` with: +- `project: "dd-trace-dotnet"` +- `buildId: ` +- `logId: ` +- Optional: `startLine` and `endLine` to limit output + +**Advantages of MCP approach:** +- Structured JSON responses (no manual parsing) +- Works naturally in conversation with AI assistants +- Handles authentication automatically +- Can combine multiple queries in a single request + ## Investigating Test Failures ### Find failed tasks in a build @@ -259,6 +290,34 @@ Connection reset by peer **Solution**: Retry the failed job. These are typically transient network issues. +#### Identifying Flaky Tests and Retry Attempts + +Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline: + +**Using curl/jq:** +```bash +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"' +``` + +**Using Azure DevOps MCP:** +Ask your assistant to check the build timeline for tasks with `previousAttempts` or `attempt > 1`. + +**What this means:** +- `"attempt": 2` with `"result": "succeeded"` → The task failed initially but passed on retry (likely a flake) +- `"previousAttempts": [...]` → Contains IDs of previous failed attempts + +**When you see retried tasks:** +1. If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue +2. The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration +3. Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below) + +**How to retry a failed job:** +1. Open the build in Azure DevOps: `https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=` +2. Find the failed stage/job +3. Click the "..." menu → "Retry failed stages" or "Retry stage" +4. Only failed stages will be retried; successful stages are not re-run + ### Unit Test Failures Failed unit tests typically appear in logs as: @@ -328,6 +387,20 @@ Common verification failures: ## Example Investigation Workflow +### Quick Investigation (AI Assistant with MCP) + +If you're using an AI assistant with Azure DevOps MCP: + +``` +"Why did Azure DevOps build fail?" +``` + +The assistant will: +1. Get build information using `mcp__azure-devops__pipelines_get_builds` +2. Identify the result and any failed stages +3. Check for retry attempts to identify flaky tests +4. Provide guidance on whether to retry or investigate further + ### Quick Investigation (GitHub CLI) ```bash