Parallel wave filling: submit IO for multiple host-buffer waves concurrently

## Context

Follow-up to #91. Increasing `host_buffer_waves` from 2 to 8 had no effect on throughput (73.3 sps → same) because the worker fills waves serially — only 1 wave of IO is ever in-flight at a time, regardless of how many host-buffer slots are available.

From your comment on #91:
> increasing the number of waves in flight is the right direction, but I don't think I fill the waves in parallel right now.

## Current behavior

`kick_peel_into_free_slots()` peels one batch into one free slab slot, submits IO for that slab, then returns. On the next tick (10µs later), if that IO completed, it peels the next batch. If IO is still pending, no new IO is submitted.

On fast local NVMe (~1ms per read), this serialization is invisible — IO completes before the next GPU wave finishes. On networked storage (~50-200ms per uncached read), the GPU frequently finishes its current wave and stalls waiting for the single in-flight IO to complete.

## Proposed behavior

When `kick_peel_into_free_slots()` finds N free slab slots, peel and submit IO for **all N** in a single tick. This lets N waves of IO be in-flight simultaneously:

- With `host_buffer_waves=8` and 2 GPU waves: up to 6 concurrent IO submissions
- Each submission goes to the same `io_queue` thread pool, which already supports parallel execution across `n_io_threads` workers
- No change to the GPU-side wave binding — waves still bind one at a time as they complete

## Expected impact

On networked storage with 50-200ms per read:
- Current: 1 wave in-flight = GPU starves after each wave completes
- Proposed: 6 waves in-flight = GPU always has a ready wave to bind

This would make `host_buffer_waves` actually function as a prefetch depth control, matching the intent described in `damacy_limits.h`:
> bumping higher lets IO for upcoming waves prefill before a wave struct frees, useful for slow / variable-latency IO backends.

## Benchmarks

From #91:
- damacy (hbw=8, serial fill): 73.3 device_sps
- standard DataLoader (8 workers × pf=4 = 32 batch lookahead): 111.1 device_sps
- damacy with warm page cache: 135.7 device_sps

The warm-cache run shows damacy's GPU pipeline is inherently faster — the bottleneck is purely IO latency hiding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel wave filling: submit IO for multiple host-buffer waves concurrently #104

Context

Current behavior

Proposed behavior

Expected impact

Benchmarks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parallel wave filling: submit IO for multiple host-buffer waves concurrently #104

Description

Context

Current behavior

Proposed behavior

Expected impact

Benchmarks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions