Skip to content

Performance problem of gemm_a16w8 #12

@xiaonans

Description

@xiaonans

I tested the performance of gemm_a16w8 kernel on AMD MI200, and found the performance is worse than pytorch(rocmblas) and triton's gemm example (https://github.com/xiaonans/triton-gemm-benchmark/blob/main/03-matrix-multiplication.py), when M is large.

I attached my performance testing results below:
image

In my performance testing, I added some codes so that I can run autotune at the first time, and do benchmark with the saved best_config. The changes I made are main...xiaonans:FLASHNN:main. I run the test with python tests/quant_gemm/test_gemm_weight_only.py.

I want to ask whether my performance testing results are expected, or there is some thing I missed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions