add magi-compiler support to qwen_image#1089
Conversation
STwangyingrui
commented
May 22, 2026
- optimize step1 time(magi compile time): use block graph instead of model graph
- add magi-compiler support to qwen_image
There was a problem hiding this comment.
Code Review
This pull request integrates magi_compiler into the neopp and qwen_image inference pipelines to enable subgraph-based compilation. It introduces a global switch for custom ops, registers several Triton and RoPE kernels as magi custom ops, and refactors transformer decoder layers to support compilation boundaries. Feedback includes concerns regarding the overhead of creating tensors inside hot loops without specifying devices, the global side effects of modifying torch._dynamo.config at the module level, and potential performance regressions from unnecessary tensor cloning in the standard execution path.
| cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32), | ||
| cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32), |
There was a problem hiding this comment.
Creating cu_seqlens tensors inside the _compute_attn method (which is called for every layer) introduces unnecessary overhead. Additionally, these tensors are created on the CPU by default, which may cause device synchronization or host-to-device copies if the attention implementation expects them on the GPU.
Consider creating these tensors once outside the layer loop or explicitly specifying the device.
| cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32), | |
| cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32), | |
| cu_seqlens_q=torch.tensor([0, seq_len_q], device=query_states.device, dtype=torch.int32), | |
| cu_seqlens_kv=torch.tensor([0, seq_len_k], device=query_states.device, dtype=torch.int32), |
| _dynamo.config.capture_scalar_outputs = False | ||
| _dynamo.config.specialize_int = False | ||
| _dynamo.config.automatic_dynamic_shapes = True |
There was a problem hiding this comment.
Modifying torch._dynamo.config at the class definition level has global side effects that occur as soon as the module is imported. This can lead to unexpected behavior in other parts of the application that rely on default Dynamo settings.
It is better to set these configurations within the __init__ method or a specific setup function, or use a context manager if supported by the environment.
| query = xq.reshape(L, H * D).contiguous().clone() | ||
| key = xk.reshape(L, H * D).contiguous().clone() |
There was a problem hiding this comment.
The addition of .clone() here introduces extra memory allocation and copy operations. While this might be necessary to avoid input mutation issues when using magi_compiler custom ops (which often require functional behavior for subgraph boundaries), it results in a performance regression for the standard eager mode execution path.
If mutation is safe for the caller, consider making the clone conditional or using a registration method that supports in-place operations without excessive dispatch overhead.