Skip to content

Conversation

@felixge
Copy link
Member

@felixge felixge commented Dec 5, 2025

What does this PR do?

Demonstrated a map write race when using partial flush. Explanation (AI-generated) below:

In spancontext.go (lines 624, 627), the partial flush code path:

620:636:ddtrace/tracer/spancontext.go
	}
	telemetry.Distribution(telemetry.NamespaceTracers, "trace_partial_flush.spans_closed", nil).Submit(float64(len(finishedSpans)))
	telemetry.Distribution(telemetry.NamespaceTracers, "trace_partial_flush.spans_remaining", nil).Submit(float64(len(leftoverSpans)))
	finishedSpans[0].setMetric(keySamplingPriority, *t.priority)
	if s != t.spans[0] {
		// Make sure the first span in the chunk has the trace-level tags
		t.setTraceTags(finishedSpans[0])
	}
	if tr, ok := tr.(*tracer); ok {
		t.finishChunk(tr, &chunk{
			spans:    finishedSpans,
			willSend: decisionKeep == samplingDecision(atomic.LoadUint32((*uint32)(&t.samplingDecision))),
		})
	}
	t.spans = leftoverSpans
}

Here, finishedSpans[0].setMetric(...) and t.setTraceTags(finishedSpans[0]) call setMeta/setMetric on finishedSpans[0], which may be a different span than s. While s (the span passed to finishedOne) has its mutex held by the caller, finishedSpans[0] does not have its mutex held.

The flow is:

  1. Span A calls finish() → locks A.mu
  2. finishedOne(A) is called
  3. Partial flush triggers if multiple spans are finished
  4. finishedSpans[0] might be span B (an earlier finished span)
  5. setMetric and setTraceTags modify span B without holding B’s mutex
    Note that setMeta and setMetric do not check the finished flag - they directly modify the maps. So even though the spans are marked finished, this is still a potential data race if any reader accesses the span data concurrently.

Motivation

Internal incident, see JIRA ticket. But this seems to be another bug, unrelated to the incident.

Reviewer's Checklist

✨ claude-4.5-opus-high wrote this test case - should probably be cleaned up before we merge this.

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running ./scripts/lint.sh locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Unsure? Have a question? Request a review!

@pr-commenter
Copy link

pr-commenter bot commented Dec 5, 2025

Benchmarks

Benchmark execution time: 2025-12-05 15:49:09

Comparing candidate commit 20aec51 in PR branch PROF-13205/felix.geisendoerfer/partial-flush-race with baseline commit 1fe6e5e in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 0 unstable metrics.

@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.06%. Comparing base (1fe6e5e) to head (20aec51).

Additional details and impacted files

see 71 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Dec 5, 2025

⚠️ Tests

⚠️ Warnings

❄️ 9 New flaky tests detected

TestPartialFlushRaceCondition from github.com/DataDog/dd-trace-go/v2/ddtrace/tracer (Datadog)
Failed

=== RUN   TestPartialFlushRaceCondition
=== RUN   TestPartialFlushRaceCondition/#00
==================
Write at 0x00c0007c4cf0 by goroutine 7330:
  runtime.mapdelete_faststr()
      /opt/hostedtoolcache/go/1.25.5/x64/src/internal/runtime/maps/runtime_faststr_swiss.go:401 +0x0
  github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*Span).setMetric()
      /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/span.go:565 +0x13d
...
TestPartialFlushRaceCondition/#00 from github.com/DataDog/dd-trace-go/v2/ddtrace/tracer (Datadog)
Failed

=== RUN   TestPartialFlushRaceCondition/#00
==================
Write at 0x00c0007c4cf0 by goroutine 7330:
  runtime.mapdelete_faststr()
      /opt/hostedtoolcache/go/1.25.5/x64/src/internal/runtime/maps/runtime_faststr_swiss.go:401 +0x0
  github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*Span).setMetric()
      /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/span.go:565 +0x13d
  github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*trace).finishedOne()
...
TestPartialFlushRaceCondition/#01 from github.com/DataDog/dd-trace-go/v2/ddtrace/tracer (Datadog)
Failed

=== RUN   TestPartialFlushRaceCondition/#01
View all

ℹ️ Info

🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 20aec51 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@felixge
Copy link
Member Author

felixge commented Dec 5, 2025

This PR is bad, closing this for now. Will reopen if I can fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants