From 71e2d80d28569948e4270941445fac8bcf4168d9 Mon Sep 17 00:00:00 2001
From: Lucas Pimentel <lucas.pimentel@datadoghq.com>
Date: Fri, 24 Oct 2025 10:51:50 -0400
Subject: [PATCH 1/3] [docs] Improve CI troubleshooting documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add new sections to help developers investigate test failures:

1. **Determining If Failures Are Related to Your Changes**
   - How to compare builds before/after your commit
   - Commands to list recent master builds and compare failed tasks
   - Explanation of master-only tests that don't run on PRs

2. **Understanding Test Infrastructure**
   - Finding test configuration and environment variable setup
   - Common gotchas (e.g., profiler tests disabling tracer)
   - Cross-cutting test failures between components
   - Tracing error messages to source code

These additions document investigation techniques used to diagnose
profiler test failures after PR #7568, including:
- Comparing master builds to determine if failures are new vs pre-existing
- Reading test infrastructure code to understand environment setup
- Understanding how changes in one component (tracer) can affect tests
  for another component (profiler)
- Using grep to trace error messages back to source code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../CI/TroubleshootingCIFailures.md           | 173 +++++++++++++++++-
 1 file changed, 172 insertions(+), 1 deletion(-)

diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md
index 13e0bc1e956e..70193fcdb34f 100644
--- a/docs/development/CI/TroubleshootingCIFailures.md
+++ b/docs/development/CI/TroubleshootingCIFailures.md
@@ -73,18 +73,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a
 
 ### Download and search logs
 
+#### Option 1: Using curl (works everywhere)
+
 ```bash
 curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
   | grep -i "fail\|error"
 ```
 
+#### Option 2: Using Azure CLI (recommended on Windows)
+
+```bash
+az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
+  2>&1 | grep -i "fail\|error"
+```
+
+Note: You may see a warning about authentication - this is safe to ignore for public builds.
+
+#### Option 3: Using GitHub CLI for quick overview
+
+```bash
+# Get quick summary of all checks for a PR
+gh pr checks <PR_NUMBER>
+
+# Get detailed PR status including links to Azure DevOps
+gh pr view <PR_NUMBER> --json statusCheckRollup
+```
+
 ### Get detailed context around failures
 
+Using curl:
+
 ```bash
 curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
   | grep -A 30 "TestName.That.Failed"
 ```
 
+Or with Azure CLI:
+
+```bash
+az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
+  2>&1 | grep -A 30 "TestName.That.Failed"
+```
+
 ## Mapping Commits to Builds
 
 Azure DevOps builds test **merge commits** (`refs/pull/<PR_NUMBER>/merge`), not branch commits directly.
@@ -107,8 +137,128 @@ To find which branch commit caused a failure:
 
 The build queued shortly after the commit was pushed is likely testing that commit.
 
+## Determining If Failures Are Related to Your Changes
+
+When tests fail on master after your PR is merged, determine if failures are new or pre-existing:
+
+### Compare with previous build on master
+
+```bash
+# List recent builds on master
+az pipelines runs list \
+  --organization https://dev.azure.com/datadoghq \
+  --project dd-trace-dotnet \
+  --branch master \
+  --top 10 \
+  --query "[].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \
+  --output table
+
+# Find the build for the commit before yours
+git log --oneline HEAD~1..HEAD  # Identify your commit and the previous one
+
+# Compare failed tasks between builds
+# Your build:
+curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<YOUR_BUILD_ID>/timeline" \
+  | jq -r '.records[] | select(.result == "failed") | .name'
+
+# Previous build:
+curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<PREVIOUS_BUILD_ID>/timeline" \
+  | jq -r '.records[] | select(.result == "failed") | .name'
+```
+
+**New failures** only appear in your build → likely related to your changes.
+**Same failures** appear in both → likely pre-existing/flaky tests.
+
+### Master-only tests
+
+Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR:
+
+1. This is expected - those tests don't run on PRs
+2. Compare with the previous successful master build to confirm they're new
+3. The failures are likely related to your changes
+
+## Understanding Test Infrastructure
+
+When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured.
+
+### Finding test configuration
+
+Tests may set up environments differently than production code. For example:
+
+```bash
+# Find how a specific test sets up environment variables
+grep -r "DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/
+
+# Look for test helper classes
+find . -name "*EnvironmentHelper*.cs" -o -name "*TestRunner*.cs"
+
+# Check what environment variables a test actually sets
+# Read the test code path from failing test name:
+# Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke
+# Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs
+```
+
+**Common gotchas:**
+- Profiler tests may disable the tracer (`DD_TRACE_ENABLED=0`)
+- Different test suites (tracer vs profiler) have different configurations
+- Test environment may not match production deployment
+
+### Cross-cutting test failures
+
+Changes in one component may affect tests for another component:
+
+- **Managed tracer changes** may affect profiler tests (they share the managed loader)
+- **Native changes** may affect managed tests (if they change initialization order)
+- **Environment variable handling** may affect both tracer and profiler
+
+**Investigation strategy:**
+1. Identify which component the failing test is for (tracer, profiler, debugger, etc.)
+2. Compare with your changes - do they touch shared infrastructure?
+3. Check if test configuration differs from production (e.g., disabled features)
+4. Trace through initialization code to find the interaction point
+
+### Tracing error messages to source code
+
+When you find an error message in logs, trace it back to source code:
+
+```bash
+# Search for the error message across the codebase
+grep -r "One or multiple services failed to start" .
+
+# Example output:
+# profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710
+#     Log::Error("One or multiple services failed to start after a delay...");
+```
+
+This helps you understand:
+- Which component is logging the error (native/managed, tracer/profiler)
+- The context of the failure (initialization, shutdown, runtime)
+- Related code that might be affected
+
 ## Common Test Failure Patterns
 
+### Infrastructure Failures (Not Your Code)
+
+Some failures are infrastructure-related and can be retried without code changes:
+
+#### Docker Rate Limiting
+
+```
+toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
+```
+
+**Solution**: Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue.
+
+#### Timeout/Network Issues
+
+```
+##[error]The job running on runner X has exceeded the maximum execution time
+TLS handshake timeout
+Connection reset by peer
+```
+
+**Solution**: Retry the failed job. These are typically transient network issues.
+
 ### Unit Test Failures
 
 Failed unit tests typically appear in logs as:
@@ -159,6 +309,21 @@ Common verification failures:
 
 ## Example Investigation Workflow
 
+### Quick Investigation (GitHub CLI)
+
+```bash
+# 1. Get quick overview of all checks
+gh pr checks <PR_NUMBER>
+
+# 2. If Azure DevOps checks failed, check the logs directly
+# Get the build ID from the Azure DevOps URL in the output above, then:
+BUILD_ID=<build_id_from_checks>
+az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/<LOG_ID>?api-version=7.0" \
+  2>&1 | grep -i "error\|fail\|toomanyrequests"
+```
+
+### Detailed Investigation
+
 ```bash
 # 1. Find your PR number
 gh pr list --head <BRANCH_NAME>
@@ -175,11 +340,17 @@ BUILD_ID=<your_build_id>
 curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/timeline" \
   | jq -r '.records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"'
 
-# 4. Download and examine logs
+# 4. Download and examine logs (choose one method)
 LOG_ID=<log_id_from_above>
+
+# Using curl:
 curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}" \
   | grep -A 30 "FAIL"
 
+# Or using Azure CLI (Windows):
+az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}?api-version=7.0" \
+  2>&1 | grep -A 30 "FAIL"
+
 # 5. Open build in browser for full details
 open "https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID}"
 ```

From b68a012a241fbb8442670f18a967c50b8ccecfc7 Mon Sep 17 00:00:00 2001
From: Lucas Pimentel <lucas.pimentel@datadoghq.com>
Date: Fri, 24 Oct 2025 15:16:22 -0400
Subject: [PATCH 2/3] [docs] Add documentation for flaky Alpine profiler stack
 walking failures
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Document known intermittent profiler failures on Alpine/musl when
walking exception stacks. These failures appear in smoke tests with
error codes like E_FAIL (80004005) or CORPROF_E_STACKSNAPSHOT_UNSAFE.

The errors are caused by race conditions during stack unwinding on
musl-based platforms and can be resolved by retrying the failed job.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../CI/TroubleshootingCIFailures.md           | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md
index 70193fcdb34f..a14345401f9e 100644
--- a/docs/development/CI/TroubleshootingCIFailures.md
+++ b/docs/development/CI/TroubleshootingCIFailures.md
@@ -281,6 +281,25 @@ Integration test failures may indicate:
 
 Check the specific integration test logs for details about which service or scenario failed.
 
+#### Flaky Profiler Stack Walking Failures (Alpine/musl)
+
+**Symptom:**
+```
+Failed to walk N stacks for sampled exception: E_FAIL (80004005)
+```
+or
+```
+Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE
+```
+
+**Appears in**: Smoke tests on Alpine Linux (musl libc), particularly `installer_smoke_tests` → `linux alpine_3_1-alpine3_14`
+
+**Cause**: Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently.
+
+**Solution**: Retry the failed job. The smoke test check `CheckSmokeTestsForErrors` has an allowlist for known patterns, but some error codes like `E_FAIL` may occasionally slip through.
+
+**Note**: The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed.
+
 ### Build Failures
 
 Build failures typically show:

From bc4c8df87498f085047589cd6b5b64f7890fad39 Mon Sep 17 00:00:00 2001
From: Lucas Pimentel <lucas.pimentel@datadoghq.com>
Date: Fri, 24 Oct 2025 15:49:59 -0400
Subject: [PATCH 3/3] [docs] Improve CI troubleshooting documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Enhanced the CI troubleshooting guide with:
- More detailed guidance on finding and analyzing failed jobs
- Step-by-step instructions for downloading and examining logs
- Common failure patterns and their solutions
- Better structure and navigation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../CI/TroubleshootingCIFailures.md           | 73 +++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/docs/development/CI/TroubleshootingCIFailures.md b/docs/development/CI/TroubleshootingCIFailures.md
index a14345401f9e..06e03137d2a8 100644
--- a/docs/development/CI/TroubleshootingCIFailures.md
+++ b/docs/development/CI/TroubleshootingCIFailures.md
@@ -54,6 +54,37 @@ The human-readable build URL format is:
 https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>
 ```
 
+### Using Azure DevOps MCP (AI Assistant Integration)
+
+If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries:
+
+#### Get build information
+Ask your assistant to use `mcp__azure-devops__pipelines_get_builds` with:
+- `project: "dd-trace-dotnet"`
+- `buildIds: [<BUILD_ID>]`
+
+This returns structured build data including status, result, queue time, and trigger information.
+
+#### Get build logs
+Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log` with:
+- `project: "dd-trace-dotnet"`
+- `buildId: <BUILD_ID>`
+
+Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs.
+
+#### Get specific log by ID
+Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log_by_id` with:
+- `project: "dd-trace-dotnet"`
+- `buildId: <BUILD_ID>`
+- `logId: <LOG_ID>`
+- Optional: `startLine` and `endLine` to limit output
+
+**Advantages of MCP approach:**
+- Structured JSON responses (no manual parsing)
+- Works naturally in conversation with AI assistants
+- Handles authentication automatically
+- Can combine multiple queries in a single request
+
 ## Investigating Test Failures
 
 ### Find failed tasks in a build
@@ -259,6 +290,34 @@ Connection reset by peer
 
 **Solution**: Retry the failed job. These are typically transient network issues.
 
+#### Identifying Flaky Tests and Retry Attempts
+
+Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline:
+
+**Using curl/jq:**
+```bash
+curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/timeline" \
+  | jq -r '.records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"'
+```
+
+**Using Azure DevOps MCP:**
+Ask your assistant to check the build timeline for tasks with `previousAttempts` or `attempt > 1`.
+
+**What this means:**
+- `"attempt": 2` with `"result": "succeeded"` → The task failed initially but passed on retry (likely a flake)
+- `"previousAttempts": [...]` → Contains IDs of previous failed attempts
+
+**When you see retried tasks:**
+1. If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue
+2. The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration
+3. Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below)
+
+**How to retry a failed job:**
+1. Open the build in Azure DevOps: `https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>`
+2. Find the failed stage/job
+3. Click the "..." menu → "Retry failed stages" or "Retry stage"
+4. Only failed stages will be retried; successful stages are not re-run
+
 ### Unit Test Failures
 
 Failed unit tests typically appear in logs as:
@@ -328,6 +387,20 @@ Common verification failures:
 
 ## Example Investigation Workflow
 
+### Quick Investigation (AI Assistant with MCP)
+
+If you're using an AI assistant with Azure DevOps MCP:
+
+```
+"Why did Azure DevOps build <BUILD_ID> fail?"
+```
+
+The assistant will:
+1. Get build information using `mcp__azure-devops__pipelines_get_builds`
+2. Identify the result and any failed stages
+3. Check for retry attempts to identify flaky tests
+4. Provide guidance on whether to retry or investigate further
+
 ### Quick Investigation (GitHub CLI)
 
 ```bash