Skip to content

Parallel wave filling: submit IO for multiple host-buffer waves concurrently #104

@ritvikvasan

Description

@ritvikvasan

Context

Follow-up to #91. Increasing host_buffer_waves from 2 to 8 had no effect on throughput (73.3 sps → same) because the worker fills waves serially — only 1 wave of IO is ever in-flight at a time, regardless of how many host-buffer slots are available.

From your comment on #91:

increasing the number of waves in flight is the right direction, but I don't think I fill the waves in parallel right now.

Current behavior

kick_peel_into_free_slots() peels one batch into one free slab slot, submits IO for that slab, then returns. On the next tick (10µs later), if that IO completed, it peels the next batch. If IO is still pending, no new IO is submitted.

On fast local NVMe (~1ms per read), this serialization is invisible — IO completes before the next GPU wave finishes. On networked storage (~50-200ms per uncached read), the GPU frequently finishes its current wave and stalls waiting for the single in-flight IO to complete.

Proposed behavior

When kick_peel_into_free_slots() finds N free slab slots, peel and submit IO for all N in a single tick. This lets N waves of IO be in-flight simultaneously:

  • With host_buffer_waves=8 and 2 GPU waves: up to 6 concurrent IO submissions
  • Each submission goes to the same io_queue thread pool, which already supports parallel execution across n_io_threads workers
  • No change to the GPU-side wave binding — waves still bind one at a time as they complete

Expected impact

On networked storage with 50-200ms per read:

  • Current: 1 wave in-flight = GPU starves after each wave completes
  • Proposed: 6 waves in-flight = GPU always has a ready wave to bind

This would make host_buffer_waves actually function as a prefetch depth control, matching the intent described in damacy_limits.h:

bumping higher lets IO for upcoming waves prefill before a wave struct frees, useful for slow / variable-latency IO backends.

Benchmarks

From #91:

  • damacy (hbw=8, serial fill): 73.3 device_sps
  • standard DataLoader (8 workers × pf=4 = 32 batch lookahead): 111.1 device_sps
  • damacy with warm page cache: 135.7 device_sps

The warm-cache run shows damacy's GPU pipeline is inherently faster — the bottleneck is purely IO latency hiding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions