Check duplicate issues.
Description
In the EIC recontruction framework CI we encounter TSAN reports of data races (e.g. in https://github.com/eic/EICrecon/actions/runs/25331418512/job/74270527718 after merge in eic/EICrecon#2469). This involves v1.0.1.0 RNTuple files generated by DD4hep simulations in the EDM4hep (podio data model) format, written with ROOT v6.38.00.
The issue appears to be in the reader inside our concurrent EICrecon reconstruction framework built on JANA2. Multiple threads read events.
After encountering the issue, I asked copilot to minimize the occurrence from our full stack and it came up with the reproducer below (and some gratuitous comments included).
Reproducer
repro_rclusterpool_race.cpp
To be compiled and run as follows (at least in our environments):
g++ -fsanitize=thread -std=c++17 -O1 $(root-config --cflags --libs) -lROOTNTuple -o repro repro_rclusterpool_race.cpp
TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
(thread sanitizer enabled, halt_on_error enabled, and setarch to disable ASLR)
Output (only first data race report included):
$ TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
Writing RNTuple (~16 clusters, 1 entry each)... done.
Reading back (this may trigger a TSAN report)...==================
WARNING: ThreadSanitizer: data race (pid=3371455)
Read of size 8 at 0x72040000af20 by main thread:
#0 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 (libtsan.so.2+0x82708) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
#1 memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:107 (libtsan.so.2+0x82708)
#2 ROOT::Internal::RPageSource::UnsealPage(ROOT::Internal::RPageStorage::RSealedPage const&, ROOT::Internal::RColumnElementBase const&, ROOT::Internal::RPageAllocator&) <null> (libROOTNTuple.so.6.38+0x2029cf) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)
#3 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)
Previous write of size 8 at 0x72040000af20 by thread T1:
#0 pread64 ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1025 (libtsan.so.2+0x5a0ed) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
#1 ROOT::Internal::RRawFileUnix::ReadAtImpl(void*, unsigned long, unsigned long) <null> (libRIO.so.6.38+0x126ab4) (BuildId: 49074ce67f254ea5495c58bbaf8f9c6ab8f0ffe2)
Location is heap block of size 16 at 0x72040000af20 allocated by thread T1:
#0 operator new[](unsigned long) ../../../../src/libsanitizer/tsan/tsan_new_delete.cpp:70 (libtsan.so.2+0x9c5f6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
#1 ROOT::Internal::RPageSourceFile::PrepareSingleCluster(ROOT::Internal::RCluster::RKey const&, std::vector<ROOT::Internal::RRawFile::RIOVec, std::allocator<ROOT::Internal::RRawFile::RIOVec> >&) <null> (libROOTNTuple.so.6.38+0x21a8c4) (BuildId: 9d3fc48eb8a60aad7661d81e7b6cc979d8db5a8c)
Thread T1 (tid=3371461, running) created by main thread at:
#0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1022 (libtsan.so.2+0x568a6) (BuildId: 99ef88596cb10a8ccf307fe1a9070f66a44b1624)
#1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xe12f8) (BuildId: 133b71e0013695cc7832680a74edb51008c4fc4c)
#2 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 (libc.so.6+0x29ca7) (BuildId: 58749c528985eab03e6700ebc1469fa50aa41219)
SUMMARY: ThreadSanitizer: data race ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115 in memcpy
==================
ROOT version
------------------------------------------------------------------
| Welcome to ROOT 6.38.00 https://root.cern |
| (c) 1995-2025, The ROOT Team; conception: R. Brun, F. Rademakers |
| Built for linuxx8664gcc on Apr 26 2026, 19:35:33 |
| From tags/6-38-00@6-38-00 |
| With g++ (Debian 14.2.0-19) 14.2.0 std202002 |
| Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q' |
------------------------------------------------------------------
Installation method
Spack
Operating system
Linux
Additional context
Copilot session with all its details
Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix
Related issues:
1. Summary
ROOT::Internal::RClusterPool has a data race between the background I/O
thread (T_io, started by RClusterPool::StartBackgroundThread()) and the
consumer thread (the thread that calls GetCluster()). The race is triggered
whenever an RNTuple is read sequentially and manifests as a
write/write race on allocator-managed memory detected by ThreadSanitizer.
The bug is present in ROOT 6.38.00 and in current master (despite the partial
mitigation in commit ecf6205ce54).
2. How the Bug Is Triggered in EICrecon
EICrecon added RNTuple output in PR
eic/EICrecon#2579
(merged 2026-04-27). The TSAN CI job did not fire on that PR because the
simulation-file cache was still valid (old TTree files were re-used). One
week later a containers update (10733ab8) changed the geometry hash, forcing
npsim to regenerate simulation files in RNTuple format. Reading those
RNTuple files in the TSAN CI job then triggered the race.
3. The Race — Step-by-Step
Scheduling
With kDefaultClusterBunchSize = 1, a single call to GetCluster(N) enqueues
two cluster bunches into the work queue:
| Bunch |
Cluster ID |
bunchId |
| N |
N |
B |
| N+1 |
N+1 |
B+1 |
Execution timeline
consumer: GetCluster(N)
→ enqueue [cluster_N/bunchB, cluster_N+1/bunchB+1] (lock held)
→ fPool.erase(cluster_N-1) (NO lock) ← eviction
→ WaitFor(N): blocks on future_N
T_io: wakes (one cond_var notify)
┌──────────────────────────────────────────────────────────────┐
│ inner while (!readItems.empty()) — NO lock between bunches │
│ │
│ Iteration 1 (bunchB): │
│ LoadClusters({cluster_N}) │
│ → RCluster::Adopt(pageMap) │
│ → fOnDiskPages.insert() → operator new(node_A) ←(1) │
│ future_N.set_value(cluster_N) ←(HB) │
│ │
│ Iteration 2 (bunchB+1): ← NO lock re-acquired here │
│ LoadClusters({cluster_N+1}) │
│ → RCluster::Adopt(pageMap) │
│ → fOnDiskPages.insert() → operator new(node_?) ←(2) │
└──────────────────────────────────────────────────────────────┘
consumer: WaitFor(N) unblocks (HB from T_io's set_value)
GetCluster(N+1)
→ fPool.erase(cluster_N) (NO lock)
→ ~RCluster() → ~unordered_map()
→ operator delete(node_A) ←(3)
The race
Steps (2) and (3) are concurrent with no synchronisation:
T_io (2) writes to node address A (initialising the hash-map node for
cluster_N+1's page map). This happens after the set_value HB boundary,
so the consumer's clock does not observe it.
consumer (3) writes to address A via operator delete (freeing
cluster_N's hash-map nodes, which the allocator recycles back to A for
T_io's next malloc).
ThreadSanitizer reports:
WARNING: ThreadSanitizer: data race
Write of size 8 by main thread:
#0 operator delete
#N ROOT::Internal::RCluster::Adopt(ROOT::Internal::RCluster&&)
#N ROOT::Internal::RClusterPool::WaitFor(...)
#N ROOT::Internal::RClusterPool::GetCluster(...)
Previous write of size 8 by thread T_io:
#0 operator new
#N std::_Hashtable::_M_insert_unique(...)
#N ROOT::Internal::RCluster::Adopt(std::unique_ptr<ROnDiskPageMap>)
#N ROOT::Internal::RClusterPool::ExecReadClusters()
4. Why the Existing Mitigation (ecf6205ce54) Is Insufficient
Commit ecf6205ce54 changed RClusterPool to start T_io lazily (on the
first GetCluster() call) rather than eagerly in the constructor.
This accidentally avoids the race for short-lived readers (e.g., EICrecon's
non-events categories read in Finish() with only a handful of clusters) because:
T_io processes all clusters in a single LoadClusters batch before the
consumer gets back to GetCluster().
T_io goes idle.
- The consumer's eviction (
fPool.erase) executes, then calls GetCluster()
again, which notify_ones T_io — establishing a happens-before edge.
However, the structural cause — ExecReadClusters looping across bunch
boundaries without re-acquiring fLockWorkQueue — remains. For any reader
with more than kDefaultClusterBunchSize clusters in flight, the race window
is still present.
5. Proposed Fix
The root cause is that GetCluster() schedules two bunches per call
(2 * fClusterBunchSize clusters), causing ExecReadClusters to process
bunch N+1 without synchronising with the consumer's eviction of bunch N.
Minimal fix: schedule only one bunch per GetCluster() call.
--- a/tree/ntuple/src/RClusterPool.cxx
+++ b/tree/ntuple/src/RClusterPool.cxx
@@ -207,8 +207,7 @@ ROOT::Internal::RCluster *ROOT::Internal::RClusterPool::GetCluster(
- for (ROOT::DescriptorId_t i = 0, next = clusterId; i < 2 * fClusterBunchSize; ++i) {
- if (i == fClusterBunchSize)
- provideInfo.fBunchId = ++fBunchId;
+ for (ROOT::DescriptorId_t i = 0, next = clusterId; i < fClusterBunchSize; ++i) {
With fClusterBunchSize = 1 (the default), this schedules exactly one cluster
per GetCluster() call. ExecReadClusters delivers it and goes idle. The
consumer's eviction runs, then notify_one wakes T_io for the next call —
establishing the required happens-before edge.
Correctness argument
| Property |
Before fix |
After fix |
Bunches per GetCluster() |
2 |
1 |
T_io idle between consumer calls |
No (inner loop) |
Yes |
| HB: consumer eviction → T_io next alloc |
Missing |
Via cond_var notify/wait |
| Lookahead depth (default settings) |
4 clusters |
2 clusters |
The lookahead depth halves (from 2 × bunchSize × nThreads to
bunchSize × nThreads), which is a small performance trade-off for
correctness. The fBunchId member and its increment can be removed as a
follow-up cleanup.
6. Reproducers
Two minimal reproducers are provided in
tree/ntuple/test/ of the ROOT master worktree.
6a. Standalone C++ reproducer (repro_rclusterpool_race.cpp)
Uses the public RNTupleWriter/RNTupleReader API. Writes 16 entries with a
per-entry cluster budget, then reads them back; each cluster boundary exercises
the race window.
Compile and run (ASLR must be disabled to avoid TSAN shadow-map conflicts):
g++ -fsanitize=thread -std=c++17 -O1 \
$(root-config --cflags --libs) -lROOTNTuple \
-o repro repro_rclusterpool_race.cpp
TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./repro
Key write options required to avoid page buffer memory budget too small:
// Page-buffer budget = 2 × ApproxZippedClusterSize; initial page size must fit.
opts.SetInitialUnzippedPageSize(sizeof(double)); // 8 bytes = 1 element/page
opts.SetApproxZippedClusterSize(sizeof(double)); // 8 bytes ≈ 1 entry/cluster
6b. GTest regression test (ntuple_cluster_race.cxx)
Uses an internal mock (RPageSourceSlowMock) that sleeps 20 ms on every
LoadClusters() call after the first. The sleep holds T_io inside bunch N+1
long enough for the consumer to evict cluster N, opening the race window.
Register in tree/ntuple/test/CMakeLists.txt:
ROOT_ADD_GTEST(ntuple_cluster_race ntuple_cluster_race.cxx LIBRARIES ROOTNTuple)
Run with TSAN:
cmake -DCMAKE_CXX_FLAGS="-fsanitize=thread" ...
ctest -R ClusterPool_NoRaceBetweenEvictionAndPrefetch
Note on TSAN detection reliability: the race manifests when the system
allocator reuses freed node addresses across threads. With glibc's per-thread
tcache enabled (default, glibc ≥ 2.26), cross-thread reuse may be suppressed
for small allocations. For reliable triggering either build ROOT with TSAN
(so the full allocation path is instrumented) or use jemalloc with
MALLOC_CONF=tcache:false.
7. Short-Term Workaround for EICrecon
Until the ROOT fix lands in a container release, add a TSAN suppression to
EICrecon/.github/tsan.supp:
race:ROOT::Internal::RCluster::Adopt
This suppresses the race report without affecting normal test output.
Check duplicate issues.
Description
In the EIC recontruction framework CI we encounter TSAN reports of data races (e.g. in https://github.com/eic/EICrecon/actions/runs/25331418512/job/74270527718 after merge in eic/EICrecon#2469). This involves v1.0.1.0 RNTuple files generated by DD4hep simulations in the EDM4hep (podio data model) format, written with ROOT v6.38.00.
The issue appears to be in the reader inside our concurrent EICrecon reconstruction framework built on JANA2. Multiple threads read events.
After encountering the issue, I asked copilot to minimize the occurrence from our full stack and it came up with the reproducer below (and some gratuitous comments included).
Reproducer
repro_rclusterpool_race.cpp
To be compiled and run as follows (at least in our environments):
(thread sanitizer enabled, halt_on_error enabled, and setarch to disable ASLR)
Output (only first data race report included):
ROOT version
Installation method
Spack
Operating system
Linux
Additional context
Copilot session with all its details
Appendix: RClusterPool Data Race — Root-Cause Analysis and Proposed Fix
1. Summary
ROOT::Internal::RClusterPoolhas a data race between the background I/Othread (
T_io, started byRClusterPool::StartBackgroundThread()) and theconsumer thread (the thread that calls
GetCluster()). The race is triggeredwhenever an RNTuple is read sequentially and manifests as a
write/write race on allocator-managed memory detected by ThreadSanitizer.
The bug is present in ROOT 6.38.00 and in current master (despite the partial
mitigation in commit
ecf6205ce54).2. How the Bug Is Triggered in EICrecon
EICrecon added RNTuple output in PR
eic/EICrecon#2579
(merged 2026-04-27). The TSAN CI job did not fire on that PR because the
simulation-file cache was still valid (old TTree files were re-used). One
week later a containers update (
10733ab8) changed the geometry hash, forcingnpsimto regenerate simulation files in RNTuple format. Reading thoseRNTuple files in the TSAN CI job then triggered the race.
3. The Race — Step-by-Step
Scheduling
With
kDefaultClusterBunchSize = 1, a single call toGetCluster(N)enqueuestwo cluster bunches into the work queue:
Execution timeline
The race
Steps (2) and (3) are concurrent with no synchronisation:
T_io (2)writes to node addressA(initialising the hash-map node forcluster_N+1's page map). This happens after theset_valueHB boundary,so the consumer's clock does not observe it.
consumer (3)writes to addressAviaoperator delete(freeingcluster_N's hash-map nodes, which the allocator recycles back toAforT_io's nextmalloc).ThreadSanitizer reports:
4. Why the Existing Mitigation (
ecf6205ce54) Is InsufficientCommit
ecf6205ce54changedRClusterPoolto startT_iolazily (on thefirst
GetCluster()call) rather than eagerly in the constructor.This accidentally avoids the race for short-lived readers (e.g., EICrecon's
non-events categories read in
Finish()with only a handful of clusters) because:T_ioprocesses all clusters in a singleLoadClustersbatch before theconsumer gets back to
GetCluster().T_iogoes idle.fPool.erase) executes, then callsGetCluster()again, which
notify_onesT_io— establishing a happens-before edge.However, the structural cause —
ExecReadClusterslooping across bunchboundaries without re-acquiring
fLockWorkQueue— remains. For any readerwith more than
kDefaultClusterBunchSizeclusters in flight, the race windowis still present.
5. Proposed Fix
The root cause is that
GetCluster()schedules two bunches per call(
2 * fClusterBunchSizeclusters), causingExecReadClustersto processbunch N+1 without synchronising with the consumer's eviction of bunch N.
Minimal fix: schedule only one bunch per
GetCluster()call.With
fClusterBunchSize = 1(the default), this schedules exactly one clusterper
GetCluster()call.ExecReadClustersdelivers it and goes idle. Theconsumer's eviction runs, then
notify_onewakesT_iofor the next call —establishing the required happens-before edge.
Correctness argument
GetCluster()T_ioidle between consumer callsnotify/waitThe lookahead depth halves (from
2 × bunchSize × nThreadstobunchSize × nThreads), which is a small performance trade-off forcorrectness. The
fBunchIdmember and its increment can be removed as afollow-up cleanup.
6. Reproducers
Two minimal reproducers are provided in
tree/ntuple/test/of the ROOT master worktree.6a. Standalone C++ reproducer (
repro_rclusterpool_race.cpp)Uses the public
RNTupleWriter/RNTupleReaderAPI. Writes 16 entries with aper-entry cluster budget, then reads them back; each cluster boundary exercises
the race window.
Compile and run (ASLR must be disabled to avoid TSAN shadow-map conflicts):
g++ -fsanitize=thread -std=c++17 -O1 \ $(root-config --cflags --libs) -lROOTNTuple \ -o repro repro_rclusterpool_race.cpp TSAN_OPTIONS="halt_on_error=0" setarch $(uname -m) -R ./reproKey write options required to avoid
page buffer memory budget too small:6b. GTest regression test (
ntuple_cluster_race.cxx)Uses an internal mock (
RPageSourceSlowMock) that sleeps 20 ms on everyLoadClusters()call after the first. The sleep holdsT_ioinside bunch N+1long enough for the consumer to evict cluster N, opening the race window.
Register in
tree/ntuple/test/CMakeLists.txt:Run with TSAN:
cmake -DCMAKE_CXX_FLAGS="-fsanitize=thread" ... ctest -R ClusterPool_NoRaceBetweenEvictionAndPrefetch7. Short-Term Workaround for EICrecon
Until the ROOT fix lands in a container release, add a TSAN suppression to
EICrecon/.github/tsan.supp:This suppresses the race report without affecting normal test output.