Skip to content

add magi-compiler support to qwen_image#1089

Open
STwangyingrui wants to merge 6 commits into
mainfrom
yr/magi_compiler
Open

add magi-compiler support to qwen_image#1089
STwangyingrui wants to merge 6 commits into
mainfrom
yr/magi_compiler

Conversation

@STwangyingrui
Copy link
Copy Markdown
Contributor

  1. optimize step1 time(magi compile time): use block graph instead of model graph
  2. add magi-compiler support to qwen_image

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates magi_compiler into the neopp and qwen_image inference pipelines to enable subgraph-based compilation. It introduces a global switch for custom ops, registers several Triton and RoPE kernels as magi custom ops, and refactors transformer decoder layers to support compilation boundaries. Feedback includes concerns regarding the overhead of creating tensors inside hot loops without specifying devices, the global side effects of modifying torch._dynamo.config at the module level, and potential performance regressions from unnecessary tensor cloning in the standard execution path.

Comment on lines +274 to +275
cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating cu_seqlens tensors inside the _compute_attn method (which is called for every layer) introduces unnecessary overhead. Additionally, these tensors are created on the CPU by default, which may cause device synchronization or host-to-device copies if the attention implementation expects them on the GPU.

Consider creating these tensors once outside the layer loop or explicitly specifying the device.

Suggested change
cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),
cu_seqlens_q=torch.tensor([0, seq_len_q], device=query_states.device, dtype=torch.int32),
cu_seqlens_kv=torch.tensor([0, seq_len_k], device=query_states.device, dtype=torch.int32),

Comment on lines +493 to +495
_dynamo.config.capture_scalar_outputs = False
_dynamo.config.specialize_int = False
_dynamo.config.automatic_dynamic_shapes = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying torch._dynamo.config at the class definition level has global side effects that occur as soon as the module is imported. This can lead to unexpected behavior in other parts of the application that rely on default Dynamo settings.

It is better to set these configurations within the __init__ method or a specific setup function, or use a context manager if supported by the environment.

Comment on lines +29 to +30
query = xq.reshape(L, H * D).contiguous().clone()
key = xk.reshape(L, H * D).contiguous().clone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of .clone() here introduces extra memory allocation and copy operations. While this might be necessary to avoid input mutation issues when using magi_compiler custom ops (which often require functional behavior for subgraph boundaries), it results in a performance regression for the standard eager mode execution path.

If mutation is safe for the caller, consider making the clone conditional or using a registration method that supports in-place operations without excessive dispatch overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant