Skip to content

[PerfXLab] optimize mul op#2180

Open
bin913 wants to merge 1 commit intoflagos-ai:masterfrom
bin913:mul
Open

[PerfXLab] optimize mul op#2180
bin913 wants to merge 1 commit intoflagos-ai:masterfrom
bin913:mul

Conversation

@bin913
Copy link
Copy Markdown
Contributor

@bin913 bin913 commented Mar 30, 2026

PR Category

[ Operator]

Type of Change

[ Performance Optimization]

Description

Optimize mul complex64 and add performance support in hopper for mul op

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

test_binary_pointwise_perf.py::test_general_binary_pointwise_perf[mul-mul-dtypes2]
Operator: mul  Performance Test (dtype=torch.float16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               2.073632            2.075920               0.999               1.034          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005600            0.005344               1.048               0.002          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.040064            0.039600               1.012               0.847          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.039904            0.039744               1.004               0.844          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               2.074560            2.076496               0.999               1.034          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.float32, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               4.158848            4.154160               1.001               0.517          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005632            0.005568               1.011               0.001          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.072576            0.072224               1.005               0.465          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.072672            0.072608               1.001               0.462          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               4.159648            4.154832               1.001               0.517          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.bfloat16, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               2.075008            2.076528               0.999               1.034          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005440            0.005376               1.012               0.002          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.039872            0.039840               1.001               0.842          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.040096            0.039744               1.009               0.844          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               2.074704            2.075904               0.999               1.034          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]


Operator: mul  Performance Test (dtype=torch.complex64, mode=kernel,level=core)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup               TFLOPS          Size Detail
--------------------------------------------------------------------------------------------------------------------
SUCCESS               8.351456            9.053088               0.922               0.237          [torch.Size([1073741824]), torch.Size([1073741824])]
SUCCESS               0.005728            0.005920               0.968               0.001          [torch.Size([64, 64]), torch.Size([64, 64])]
SUCCESS               0.137344            0.146240               0.939               0.229          [torch.Size([4096, 4096]), torch.Size([4096, 4096])]
SUCCESS               0.137408            0.142976               0.961               0.235          [torch.Size([64, 512, 512]), torch.Size([64, 512, 512])]
SUCCESS               8.349664            8.872928               0.941               0.242          [torch.Size([1024, 1024, 1024]), torch.Size([1024, 1024, 1024])]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant