diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md index 13e0bc1e956e..06e03137d2a8 100644 --- a/docs/development/CI/TroubleshootingCIFailures.md +++ b/docs/development/CI/TroubleshootingCIFailures.md @@ -54,6 +54,37 @@ The human-readable build URL format is: https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId= ``` +### Using Azure DevOps MCP (AI Assistant Integration) + +If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries: + +#### Get build information +Ask your assistant to use `mcp__azure-devops__pipelines_get_builds` with: +- `project: "dd-trace-dotnet"` +- `buildIds: []` + +This returns structured build data including status, result, queue time, and trigger information. + +#### Get build logs +Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log` with: +- `project: "dd-trace-dotnet"` +- `buildId: ` + +Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs. + +#### Get specific log by ID +Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log_by_id` with: +- `project: "dd-trace-dotnet"` +- `buildId: ` +- `logId: ` +- Optional: `startLine` and `endLine` to limit output + +**Advantages of MCP approach:** +- Structured JSON responses (no manual parsing) +- Works naturally in conversation with AI assistants +- Handles authentication automatically +- Can combine multiple queries in a single request + ## Investigating Test Failures ### Find failed tasks in a build @@ -73,18 +104,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a ### Download and search logs +#### Option 1: Using curl (works everywhere) + ```bash curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/" \ | grep -i "fail\|error" ``` +#### Option 2: Using Azure CLI (recommended on Windows) + +```bash +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/?api-version=7.0" \ + 2>&1 | grep -i "fail\|error" +``` + +Note: You may see a warning about authentication - this is safe to ignore for public builds. + +#### Option 3: Using GitHub CLI for quick overview + +```bash +# Get quick summary of all checks for a PR +gh pr checks + +# Get detailed PR status including links to Azure DevOps +gh pr view --json statusCheckRollup +``` + ### Get detailed context around failures +Using curl: + ```bash curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/" \ | grep -A 30 "TestName.That.Failed" ``` +Or with Azure CLI: + +```bash +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//logs/?api-version=7.0" \ + 2>&1 | grep -A 30 "TestName.That.Failed" +``` + ## Mapping Commits to Builds Azure DevOps builds test **merge commits** (`refs/pull//merge`), not branch commits directly. @@ -107,8 +168,156 @@ To find which branch commit caused a failure: The build queued shortly after the commit was pushed is likely testing that commit. +## Determining If Failures Are Related to Your Changes + +When tests fail on master after your PR is merged, determine if failures are new or pre-existing: + +### Compare with previous build on master + +```bash +# List recent builds on master +az pipelines runs list \ + --organization https://dev.azure.com/datadoghq \ + --project dd-trace-dotnet \ + --branch master \ + --top 10 \ + --query "[].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \ + --output table + +# Find the build for the commit before yours +git log --oneline HEAD~1..HEAD # Identify your commit and the previous one + +# Compare failed tasks between builds +# Your build: +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.result == "failed") | .name' + +# Previous build: +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.result == "failed") | .name' +``` + +**New failures** only appear in your build → likely related to your changes. +**Same failures** appear in both → likely pre-existing/flaky tests. + +### Master-only tests + +Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR: + +1. This is expected - those tests don't run on PRs +2. Compare with the previous successful master build to confirm they're new +3. The failures are likely related to your changes + +## Understanding Test Infrastructure + +When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured. + +### Finding test configuration + +Tests may set up environments differently than production code. For example: + +```bash +# Find how a specific test sets up environment variables +grep -r "DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/ + +# Look for test helper classes +find . -name "*EnvironmentHelper*.cs" -o -name "*TestRunner*.cs" + +# Check what environment variables a test actually sets +# Read the test code path from failing test name: +# Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke +# Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs +``` + +**Common gotchas:** +- Profiler tests may disable the tracer (`DD_TRACE_ENABLED=0`) +- Different test suites (tracer vs profiler) have different configurations +- Test environment may not match production deployment + +### Cross-cutting test failures + +Changes in one component may affect tests for another component: + +- **Managed tracer changes** may affect profiler tests (they share the managed loader) +- **Native changes** may affect managed tests (if they change initialization order) +- **Environment variable handling** may affect both tracer and profiler + +**Investigation strategy:** +1. Identify which component the failing test is for (tracer, profiler, debugger, etc.) +2. Compare with your changes - do they touch shared infrastructure? +3. Check if test configuration differs from production (e.g., disabled features) +4. Trace through initialization code to find the interaction point + +### Tracing error messages to source code + +When you find an error message in logs, trace it back to source code: + +```bash +# Search for the error message across the codebase +grep -r "One or multiple services failed to start" . + +# Example output: +# profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710 +# Log::Error("One or multiple services failed to start after a delay..."); +``` + +This helps you understand: +- Which component is logging the error (native/managed, tracer/profiler) +- The context of the failure (initialization, shutdown, runtime) +- Related code that might be affected + ## Common Test Failure Patterns +### Infrastructure Failures (Not Your Code) + +Some failures are infrastructure-related and can be retried without code changes: + +#### Docker Rate Limiting + +``` +toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit +``` + +**Solution**: Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue. + +#### Timeout/Network Issues + +``` +##[error]The job running on runner X has exceeded the maximum execution time +TLS handshake timeout +Connection reset by peer +``` + +**Solution**: Retry the failed job. These are typically transient network issues. + +#### Identifying Flaky Tests and Retry Attempts + +Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline: + +**Using curl/jq:** +```bash +curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds//timeline" \ + | jq -r '.records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"' +``` + +**Using Azure DevOps MCP:** +Ask your assistant to check the build timeline for tasks with `previousAttempts` or `attempt > 1`. + +**What this means:** +- `"attempt": 2` with `"result": "succeeded"` → The task failed initially but passed on retry (likely a flake) +- `"previousAttempts": [...]` → Contains IDs of previous failed attempts + +**When you see retried tasks:** +1. If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue +2. The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration +3. Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below) + +**How to retry a failed job:** +1. Open the build in Azure DevOps: `https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=` +2. Find the failed stage/job +3. Click the "..." menu → "Retry failed stages" or "Retry stage" +4. Only failed stages will be retried; successful stages are not re-run + ### Unit Test Failures Failed unit tests typically appear in logs as: @@ -131,6 +340,25 @@ Integration test failures may indicate: Check the specific integration test logs for details about which service or scenario failed. +#### Flaky Profiler Stack Walking Failures (Alpine/musl) + +**Symptom:** +``` +Failed to walk N stacks for sampled exception: E_FAIL (80004005) +``` +or +``` +Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE +``` + +**Appears in**: Smoke tests on Alpine Linux (musl libc), particularly `installer_smoke_tests` → `linux alpine_3_1-alpine3_14` + +**Cause**: Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently. + +**Solution**: Retry the failed job. The smoke test check `CheckSmokeTestsForErrors` has an allowlist for known patterns, but some error codes like `E_FAIL` may occasionally slip through. + +**Note**: The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed. + ### Build Failures Build failures typically show: @@ -159,6 +387,35 @@ Common verification failures: ## Example Investigation Workflow +### Quick Investigation (AI Assistant with MCP) + +If you're using an AI assistant with Azure DevOps MCP: + +``` +"Why did Azure DevOps build fail?" +``` + +The assistant will: +1. Get build information using `mcp__azure-devops__pipelines_get_builds` +2. Identify the result and any failed stages +3. Check for retry attempts to identify flaky tests +4. Provide guidance on whether to retry or investigate further + +### Quick Investigation (GitHub CLI) + +```bash +# 1. Get quick overview of all checks +gh pr checks + +# 2. If Azure DevOps checks failed, check the logs directly +# Get the build ID from the Azure DevOps URL in the output above, then: +BUILD_ID= +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/?api-version=7.0" \ + 2>&1 | grep -i "error\|fail\|toomanyrequests" +``` + +### Detailed Investigation + ```bash # 1. Find your PR number gh pr list --head @@ -175,11 +432,17 @@ BUILD_ID= curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/timeline" \ | jq -r '.records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"' -# 4. Download and examine logs +# 4. Download and examine logs (choose one method) LOG_ID= + +# Using curl: curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}" \ | grep -A 30 "FAIL" +# Or using Azure CLI (Windows): +az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}?api-version=7.0" \ + 2>&1 | grep -A 30 "FAIL" + # 5. Open build in browser for full details open "https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID}" ```