Context
Follow-up to #91. Increasing host_buffer_waves from 2 to 8 had no effect on throughput (73.3 sps → same) because the worker fills waves serially — only 1 wave of IO is ever in-flight at a time, regardless of how many host-buffer slots are available.
From your comment on #91:
increasing the number of waves in flight is the right direction, but I don't think I fill the waves in parallel right now.
Current behavior
kick_peel_into_free_slots() peels one batch into one free slab slot, submits IO for that slab, then returns. On the next tick (10µs later), if that IO completed, it peels the next batch. If IO is still pending, no new IO is submitted.
On fast local NVMe (~1ms per read), this serialization is invisible — IO completes before the next GPU wave finishes. On networked storage (~50-200ms per uncached read), the GPU frequently finishes its current wave and stalls waiting for the single in-flight IO to complete.
Proposed behavior
When kick_peel_into_free_slots() finds N free slab slots, peel and submit IO for all N in a single tick. This lets N waves of IO be in-flight simultaneously:
- With
host_buffer_waves=8 and 2 GPU waves: up to 6 concurrent IO submissions
- Each submission goes to the same
io_queue thread pool, which already supports parallel execution across n_io_threads workers
- No change to the GPU-side wave binding — waves still bind one at a time as they complete
Expected impact
On networked storage with 50-200ms per read:
- Current: 1 wave in-flight = GPU starves after each wave completes
- Proposed: 6 waves in-flight = GPU always has a ready wave to bind
This would make host_buffer_waves actually function as a prefetch depth control, matching the intent described in damacy_limits.h:
bumping higher lets IO for upcoming waves prefill before a wave struct frees, useful for slow / variable-latency IO backends.
Benchmarks
From #91:
- damacy (hbw=8, serial fill): 73.3 device_sps
- standard DataLoader (8 workers × pf=4 = 32 batch lookahead): 111.1 device_sps
- damacy with warm page cache: 135.7 device_sps
The warm-cache run shows damacy's GPU pipeline is inherently faster — the bottleneck is purely IO latency hiding.
Context
Follow-up to #91. Increasing
host_buffer_wavesfrom 2 to 8 had no effect on throughput (73.3 sps → same) because the worker fills waves serially — only 1 wave of IO is ever in-flight at a time, regardless of how many host-buffer slots are available.From your comment on #91:
Current behavior
kick_peel_into_free_slots()peels one batch into one free slab slot, submits IO for that slab, then returns. On the next tick (10µs later), if that IO completed, it peels the next batch. If IO is still pending, no new IO is submitted.On fast local NVMe (~1ms per read), this serialization is invisible — IO completes before the next GPU wave finishes. On networked storage (~50-200ms per uncached read), the GPU frequently finishes its current wave and stalls waiting for the single in-flight IO to complete.
Proposed behavior
When
kick_peel_into_free_slots()finds N free slab slots, peel and submit IO for all N in a single tick. This lets N waves of IO be in-flight simultaneously:host_buffer_waves=8and 2 GPU waves: up to 6 concurrent IO submissionsio_queuethread pool, which already supports parallel execution acrossn_io_threadsworkersExpected impact
On networked storage with 50-200ms per read:
This would make
host_buffer_wavesactually function as a prefetch depth control, matching the intent described indamacy_limits.h:Benchmarks
From #91:
The warm-cache run shows damacy's GPU pipeline is inherently faster — the bottleneck is purely IO latency hiding.