Skip to content

[XPU] WeightAsyncStreamManager: stream priority and sync strategy need XPU-specific branch #961

@Tyr0727

Description

@Tyr0727

Environment

  • Hardware: Intel Arc 140V (Lunar Lake iGPU, unified LPDDR5X memory)
  • OS: Windows 11
  • torch version: 2.9.0+xpu
  • Platform: lightx2v_platform with AI_DEVICE = "xpu"

Problem

When running block/phase offload on Intel XPU, WeightAsyncStreamManager uses the
same stream priority configuration as CUDA, which causes two distinct failures:

1. HIGH-priority stream crashes on compute kernels

The original code assigns compute_stream = Stream(priority=-1) (HIGH priority).
On Intel Arc XPU, HIGH-priority streams are only safe for copy_() operations.
Using them for heavy compute kernels (oneDNN matmul, attention) causes a hard crash:
RuntimeError: [oneDNN] ... stream priority conflict / illegal use of high-priority stream

2. priority=+1 (LOW) is not supported

torch.xpu.Stream(priority=1) raises an error on Arc 140V — only -1 (HIGH) and
0 (DEFAULT) are accepted.

3. Cross-stream memory visibility requires device-wide sync

After cuda_load_stream.synchronize() on XPU, tensors written in that stream are
not guaranteed to be visible to the compute stream. A per-stream sync is insufficient;
torch.xpu.synchronize() (device-wide) is required to ensure correct H2D prefetch
visibility before compute.

Root Cause

Intel XPU stream semantics differ from CUDA:

CUDA Intel XPU (Arc 140V)
HIGH stream (priority=-1) safe for all ops copy-only, crashes on compute
LOW stream (priority=+1) supported not supported
Cross-stream visibility per-stream sync sufficient device-wide sync required

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions