[PerfXLab] optimize mul op by bin913 · Pull Request #2180 · flagos-ai/FlagGems

bin913 · 2026-03-30T03:46:49Z

PR Category

[ Operator]

Type of Change

[ Performance Optimization]

Description

Optimize mul complex64 and add performance support in hopper for mul op

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

test_binary_pointwise_perf.py::test_general_binary_pointwise_perf[mul-mul-dtypes2]
Operator: mul  Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               2.073632            2.075920               0.999               1.034          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005600            0.005344               1.048               0.002          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.040064            0.039600               1.012               0.847          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.039904            0.039744               1.004               0.844          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               2.074560            2.076496               0.999               1.034          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               4.158848            4.154160               1.001               0.517          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005632            0.005568               1.011               0.001          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.072576            0.072224               1.005               0.465          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.072672            0.072608               1.001               0.462          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               4.159648            4.154832               1.001               0.517          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               2.075008            2.076528               0.999               1.034          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005440            0.005376               1.012               0.002          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.039872            0.039840               1.001               0.842          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.040096            0.039744               1.009               0.844          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               2.074704            2.075904               0.999               1.034          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.complex64, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               8.351456            9.053088               0.922               0.237          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005728            0.005920               0.968               0.001          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.137344            0.146240               0.939               0.229          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.137408            0.142976               0.961               0.235          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               8.349664            8.872928               0.941               0.242          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]

optimize mul op

8130f39

github-actions bot added ops/aten vendor/NVIDIA size/Medium labels Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PerfXLab] optimize mul op#2180

[PerfXLab] optimize mul op#2180
bin913 wants to merge 1 commit intoflagos-ai:masterfrom
bin913:mul

bin913 commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bin913 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Type of Change

Description

Issue

Progress

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bin913 commented Mar 30, 2026 •

edited

Loading