Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 264 additions & 1 deletion docs/development/CI/TroubleshootingCIFailures.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,37 @@ The human-readable build URL format is:
https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>
```

### Using Azure DevOps MCP (AI Assistant Integration)

If you're using an AI assistant with the Azure DevOps MCP server, you can use these tools for cleaner queries:

#### Get build information
Ask your assistant to use `mcp__azure-devops__pipelines_get_builds` with:
- `project: "dd-trace-dotnet"`
- `buildIds: [<BUILD_ID>]`

This returns structured build data including status, result, queue time, and trigger information.

#### Get build logs
Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log` with:
- `project: "dd-trace-dotnet"`
- `buildId: <BUILD_ID>`

Note: Large builds may have very large logs that exceed token limits. In that case, fall back to curl/jq to target specific log IDs.

#### Get specific log by ID
Ask your assistant to use `mcp__azure-devops__pipelines_get_build_log_by_id` with:
- `project: "dd-trace-dotnet"`
- `buildId: <BUILD_ID>`
- `logId: <LOG_ID>`
- Optional: `startLine` and `endLine` to limit output

**Advantages of MCP approach:**
- Structured JSON responses (no manual parsing)
- Works naturally in conversation with AI assistants
- Handles authentication automatically
- Can combine multiple queries in a single request

## Investigating Test Failures

### Find failed tasks in a build
Expand All @@ -73,18 +104,48 @@ curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_a

### Download and search logs

#### Option 1: Using curl (works everywhere)

```bash
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
| grep -i "fail\|error"
```

#### Option 2: Using Azure CLI (recommended on Windows)

```bash
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
2>&1 | grep -i "fail\|error"
```

Note: You may see a warning about authentication - this is safe to ignore for public builds.

#### Option 3: Using GitHub CLI for quick overview

```bash
# Get quick summary of all checks for a PR
gh pr checks <PR_NUMBER>

# Get detailed PR status including links to Azure DevOps
gh pr view <PR_NUMBER> --json statusCheckRollup
```

### Get detailed context around failures

Using curl:

```bash
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>" \
| grep -A 30 "TestName.That.Failed"
```

Or with Azure CLI:

```bash
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/logs/<LOG_ID>?api-version=7.0" \
2>&1 | grep -A 30 "TestName.That.Failed"
```

## Mapping Commits to Builds

Azure DevOps builds test **merge commits** (`refs/pull/<PR_NUMBER>/merge`), not branch commits directly.
Expand All @@ -107,8 +168,156 @@ To find which branch commit caused a failure:

The build queued shortly after the commit was pushed is likely testing that commit.

## Determining If Failures Are Related to Your Changes

When tests fail on master after your PR is merged, determine if failures are new or pre-existing:

### Compare with previous build on master

```bash
# List recent builds on master
az pipelines runs list \
--organization https://dev.azure.com/datadoghq \
--project dd-trace-dotnet \
--branch master \
--top 10 \
--query "[].{id:id, result:result, sourceVersion:sourceVersion, finishTime:finishTime}" \
--output table

# Find the build for the commit before yours
git log --oneline HEAD~1..HEAD # Identify your commit and the previous one

# Compare failed tasks between builds
# Your build:
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<YOUR_BUILD_ID>/timeline" \
| jq -r '.records[] | select(.result == "failed") | .name'

# Previous build:
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<PREVIOUS_BUILD_ID>/timeline" \
| jq -r '.records[] | select(.result == "failed") | .name'
```

**New failures** only appear in your build → likely related to your changes.
**Same failures** appear in both → likely pre-existing/flaky tests.

### Master-only tests

Some tests (profiler integration tests, exploration tests) run only on master branch, not on PRs. If you see failures on master that you didn't see on your PR:

1. This is expected - those tests don't run on PRs
2. Compare with the previous successful master build to confirm they're new
3. The failures are likely related to your changes

## Understanding Test Infrastructure

When test failures don't make obvious sense, investigate the test infrastructure to understand how tests are configured.

### Finding test configuration

Tests may set up environments differently than production code. For example:

```bash
# Find how a specific test sets up environment variables
grep -r "DD_DOTNET_TRACER_HOME\|DD_TRACE_ENABLED" profiler/test/

# Look for test helper classes
find . -name "*EnvironmentHelper*.cs" -o -name "*TestRunner*.cs"

# Check what environment variables a test actually sets
# Read the test code path from failing test name:
# Example: Datadog.Profiler.SmokeTests.WebsiteAspNetCore01Test.CheckSmoke
# Path: profiler/test/Datadog.Profiler.IntegrationTests/SmokeTests/WebsiteAspNetCore01Test.cs
```

**Common gotchas:**
- Profiler tests may disable the tracer (`DD_TRACE_ENABLED=0`)
- Different test suites (tracer vs profiler) have different configurations
- Test environment may not match production deployment

### Cross-cutting test failures

Changes in one component may affect tests for another component:

- **Managed tracer changes** may affect profiler tests (they share the managed loader)
- **Native changes** may affect managed tests (if they change initialization order)
- **Environment variable handling** may affect both tracer and profiler

**Investigation strategy:**
1. Identify which component the failing test is for (tracer, profiler, debugger, etc.)
2. Compare with your changes - do they touch shared infrastructure?
3. Check if test configuration differs from production (e.g., disabled features)
4. Trace through initialization code to find the interaction point

### Tracing error messages to source code

When you find an error message in logs, trace it back to source code:

```bash
# Search for the error message across the codebase
grep -r "One or multiple services failed to start" .

# Example output:
# profiler/src/ProfilerEngine/Datadog.Profiler.Native/CorProfilerCallback.cpp:710
# Log::Error("One or multiple services failed to start after a delay...");
```

This helps you understand:
- Which component is logging the error (native/managed, tracer/profiler)
- The context of the failure (initialization, shutdown, runtime)
- Related code that might be affected

## Common Test Failure Patterns

### Infrastructure Failures (Not Your Code)

Some failures are infrastructure-related and can be retried without code changes:

#### Docker Rate Limiting

```
toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
```

**Solution**: Retry the failed job in Azure DevOps. This is a transient Docker Hub rate limit issue.

#### Timeout/Network Issues

```
##[error]The job running on runner X has exceeded the maximum execution time
TLS handshake timeout
Connection reset by peer
```

**Solution**: Retry the failed job. These are typically transient network issues.

#### Identifying Flaky Tests and Retry Attempts

Azure DevOps automatically retries some failed stages. You can identify retried tasks in the build timeline:

**Using curl/jq:**
```bash
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/<BUILD_ID>/timeline" \
| jq -r '.records[] | select(.previousAttempts != null and (.previousAttempts | length) > 0) | "\(.name): attempt \(.attempt), previous attempts: \(.previousAttempts | length)"'
```

**Using Azure DevOps MCP:**
Ask your assistant to check the build timeline for tasks with `previousAttempts` or `attempt > 1`.

**What this means:**
- `"attempt": 2` with `"result": "succeeded"` → The task failed initially but passed on retry (likely a flake)
- `"previousAttempts": [...]` → Contains IDs of previous failed attempts

**When you see retried tasks:**
1. If a task succeeded on retry after an initial failure, it's likely a flaky/intermittent issue
2. The overall build result may still show as "failed" even if the retry succeeded, depending on pipeline configuration
3. Check if the failure pattern is known (see "Flaky Profiler Stack Walking Failures" below)

**How to retry a failed job:**
1. Open the build in Azure DevOps: `https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=<BUILD_ID>`
2. Find the failed stage/job
3. Click the "..." menu → "Retry failed stages" or "Retry stage"
4. Only failed stages will be retried; successful stages are not re-run

### Unit Test Failures

Failed unit tests typically appear in logs as:
Expand All @@ -131,6 +340,25 @@ Integration test failures may indicate:

Check the specific integration test logs for details about which service or scenario failed.

#### Flaky Profiler Stack Walking Failures (Alpine/musl)

**Symptom:**
```
Failed to walk N stacks for sampled exception: E_FAIL (80004005)
```
or
```
Failed to walk N stacks for sampled exception: CORPROF_E_STACKSNAPSHOT_UNSAFE
```

**Appears in**: Smoke tests on Alpine Linux (musl libc), particularly `installer_smoke_tests` → `linux alpine_3_1-alpine3_14`

**Cause**: Race condition in the profiler when unwinding call stacks while threads are running. This is a known limitation on Alpine/musl platforms and appears intermittently.

**Solution**: Retry the failed job. The smoke test check `CheckSmokeTestsForErrors` has an allowlist for known patterns, but some error codes like `E_FAIL` may occasionally slip through.

**Note**: The profiler only logs these warnings every 100 failures to avoid log spam, so seeing this message indicates multiple stack walking attempts have failed.

### Build Failures

Build failures typically show:
Expand Down Expand Up @@ -159,6 +387,35 @@ Common verification failures:

## Example Investigation Workflow

### Quick Investigation (AI Assistant with MCP)

If you're using an AI assistant with Azure DevOps MCP:

```
"Why did Azure DevOps build <BUILD_ID> fail?"
```

The assistant will:
1. Get build information using `mcp__azure-devops__pipelines_get_builds`
2. Identify the result and any failed stages
3. Check for retry attempts to identify flaky tests
4. Provide guidance on whether to retry or investigate further

### Quick Investigation (GitHub CLI)

```bash
# 1. Get quick overview of all checks
gh pr checks <PR_NUMBER>

# 2. If Azure DevOps checks failed, check the logs directly
# Get the build ID from the Azure DevOps URL in the output above, then:
BUILD_ID=<build_id_from_checks>
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/<LOG_ID>?api-version=7.0" \
2>&1 | grep -i "error\|fail\|toomanyrequests"
```

### Detailed Investigation

```bash
# 1. Find your PR number
gh pr list --head <BRANCH_NAME>
Expand All @@ -175,11 +432,17 @@ BUILD_ID=<your_build_id>
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/timeline" \
| jq -r '.records[] | select(.result == "failed") | "\(.name): log.id=\(.log.id)"'

# 4. Download and examine logs
# 4. Download and examine logs (choose one method)
LOG_ID=<log_id_from_above>

# Using curl:
curl -s "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}" \
| grep -A 30 "FAIL"

# Or using Azure CLI (Windows):
az rest --url "https://dev.azure.com/datadoghq/a51c4863-3eb4-4c5d-878a-58b41a049e4e/_apis/build/builds/${BUILD_ID}/logs/${LOG_ID}?api-version=7.0" \
2>&1 | grep -A 30 "FAIL"

# 5. Open build in browser for full details
open "https://dev.azure.com/datadoghq/dd-trace-dotnet/_build/results?buildId=${BUILD_ID}"
```
Loading