Skip to content

Conversation

jasonlizhengjian
Copy link

@jasonlizhengjian jasonlizhengjian commented Oct 1, 2025

Purpose

Change the heuristic so that the flashinfer TRTLLM attention gets used more for prefill. Previously it was only used for <= 256 tokens despite being faster (benchmark below) for all cases tested.

Test Plan

benchmark using benchmarks/kernels/benchmark_trtllm_prefill_attention.py , datapoints causing OOM were left out

Test Result

Benchmark results below. speedup_% > 0 always meaning TRTLLM attention is always faster for prefill

Running benchmark for q_dtype = torch.bfloat16, kv_cache_dtype: torch.bfloat16, output_dtype: torch.bfloat16
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            512          0.085        0.004        0.225          0.007         0.624     
4            512          0.151        0.004        0.385          0.007         0.607     
8            512          0.260        0.005        0.648          0.010         0.598     
16           512          0.370        0.005        0.871          0.010         0.576     
32           512          0.718        0.004        1.683          0.007         0.573     
64           512          1.391        0.026        3.249          0.006         0.572     
128          512          2.708        0.146        5.844          0.008         0.537     
256          512          5.707        0.198        11.320         0.008         0.496     
1            1024         0.106        0.010        0.251          0.011         0.580     
1            1024         0.085        0.005        0.225          0.007         0.620     
4            1024         0.171        0.008        0.408          0.013         0.581     
4            1024         0.151        0.004        0.385          0.006         0.608     
8            1024         0.262        0.005        0.644          0.009         0.594     
8            1024         0.261        0.004        0.643          0.011         0.594     
16           1024         0.371        0.005        0.872          0.007         0.575     
16           1024         0.371        0.004        0.875          0.009         0.577     
32           1024         0.720        0.005        1.684          0.008         0.572     
32           1024         0.720        0.004        1.685          0.009         0.573     
64           1024         1.423        0.051        3.254          0.013         0.563     
64           1024         1.413        0.040        3.252          0.006         0.565     
128          1024         2.832        0.124        5.843          0.009         0.515     
128          1024         2.824        0.101        5.843          0.009         0.517     
256          1024         5.691        0.099        11.318         0.007         0.497     
256          1024         5.720        0.059        11.316         0.009         0.495     
1            2048         0.230        0.005        0.757          0.009         0.696     
4            2048         0.344        0.004        1.121          0.020         0.693     
8            2048         0.707        0.004        2.094          0.010         0.663     
16           2048         1.688        0.085        4.852          0.081         0.652     
32           2048         2.829        0.158        7.477          0.010         0.622     
64           2048         5.790        0.046        14.610         0.010         0.604     
128          2048         9.043        0.064        22.186         0.011         0.592     
256          2048         18.523       0.224        44.758         0.010         0.586     
1            4096         0.680        0.005        2.610          0.055         0.739     
4            4096         1.769        0.072        5.857          0.029         0.698     
8            4096         2.119        0.110        6.534          0.024         0.676     
16           4096         3.332        0.143        9.605          0.013         0.653     
32           4096         7.574        0.128        21.123         0.013         0.641     
64           4096         14.192       0.096        39.681         0.014         0.642     
128          4096         30.590       0.274        83.658         0.016         0.634     
256          4096         61.134       0.313        167.600        0.018         0.635     
1            8192         2.681        0.153        8.959          0.017         0.701     
4            8192         6.259        0.059        19.451         0.017         0.678     
8            8192         8.759        0.069        27.158         0.014         0.677     
16           8192         11.714       0.144        35.414         0.013         0.669     
32           8192         28.710       0.281        86.261         0.011         0.667     
64           8192         52.886       0.198        157.488        0.017         0.664     
128          8192         105.127      1.045        311.325        0.035         0.662     
256          8192         225.979      0.295        674.096        0.066         0.665     
1            16384        10.551       0.117        34.815         0.015         0.697     
4            16384        16.335       0.259        52.433         0.054         0.688     
8            16384        25.001       0.294        80.182         0.019         0.688     
16           16384        33.308       0.310        104.168        0.018         0.680     
32           16384        79.811       0.474        248.932        0.022         0.679     
64           16384        184.203      0.214        573.282        0.043         0.679     
128          16384        393.760      0.540        1232.027       0.055         0.680     
1            32768        42.018       0.389        137.572        0.029         0.695     
4            32768        82.313       0.399        267.357        0.018         0.692     
8            32768        121.479      0.587        391.364        0.021         0.690     
16           32768        208.660      0.862        672.257        0.030         0.690     
32           32768        429.713      0.577        1388.193       0.104         0.690     
64           32768        786.759      1.228        2536.631       0.034         0.690     
1            65536        165.439      0.401        547.283        0.049         0.698     
4            65536        308.944      0.689        1016.233       0.789         0.696     
8            65536        508.776      0.606        1667.827       0.090         0.695     
16           65536        974.115      1.487        3200.457       0.106         0.696     
32           65536        1809.411     0.564        5929.095       2.048         0.695     
1            131072       666.344      0.679        2183.591       0.146         0.695     
4            131072       1230.011     1.359        4031.369       0.090         0.695     
8            131072       2139.787     1.615        7026.548       0.105         0.695     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.bfloat16
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.183        0.009        0.239          0.009         0.235     
4            1024         0.327        0.009        0.414          0.012         0.208     
8            1024         0.564        0.004        0.648          0.008         0.129     
16           1024         0.774        0.004        0.872          0.007         0.113     
32           1024         1.539        0.004        1.683          0.008         0.086     
64           1024         3.016        0.004        3.249          0.007         0.072     
128          1024         5.515        0.004        5.841          0.009         0.056     
256          1024         10.668       0.004        11.317         0.008         0.057     
1            2048         0.501        0.004        0.686          0.008         0.270     
4            2048         0.802        0.004        1.077          0.015         0.256     
8            2048         1.661        0.004        2.094          0.011         0.207     
16           2048         3.815        0.004        4.832          0.009         0.211     
32           2048         6.069        0.004        7.478          0.008         0.188     
64           2048         11.779       0.004        14.606         0.011         0.194     
128          2048         18.229       0.005        22.184         0.010         0.178     
256          2048         36.680       0.004        44.756         0.006         0.180     
1            4096         1.691        0.004        2.415          0.008         0.300     
4            4096         4.198        0.004        5.836          0.011         0.281     
8            4096         4.819        0.004        6.517          0.012         0.261     
16           4096         7.206        0.004        9.616          0.065         0.251     
32           4096         15.930       0.004        21.123         0.011         0.246     
64           4096         29.901       0.005        39.681         0.016         0.246     
128          4096         63.282       0.006        83.665         0.017         0.244     
256          4096         126.430      0.009        167.562        0.110         0.245     
1            8192         6.232        0.004        8.938          0.019         0.303     
4            8192         13.811       0.005        19.430         0.034         0.289     
8            8192         19.398       0.005        27.146         0.029         0.285     
16           8192         25.451       0.005        35.427         0.015         0.282     
32           8192         61.390       0.006        86.271         0.015         0.288     
64           8192         112.926      0.008        157.502        0.019         0.283     
128          8192         223.928      0.283        311.317        0.030         0.281     
256          8192         482.216      0.025        674.070        0.046         0.285     
1            16384        23.895       0.005        34.816         0.014         0.314     
4            16384        36.135       0.007        52.407         0.016         0.310     
8            16384        55.230       0.006        80.173         0.017         0.311     
16           16384        72.715       0.007        104.157        0.013         0.302     
32           16384        173.534      0.010        248.921        0.020         0.303     
64           16384        400.088      0.019        573.297        0.079         0.302     
128          16384        857.811      0.039        1231.998       0.064         0.304     
1            32768        93.554       0.010        137.564        0.024         0.320     
4            32768        182.752      0.012        267.346        0.022         0.316     
8            32768        268.636      0.013        391.353        0.020         0.314     
16           32768        460.093      0.017        672.222        0.042         0.316     
32           32768        950.321      0.060        1388.119       0.037         0.315     
64           32768        1736.041     0.040        2536.647       0.099         0.316     
1            65536        369.510      0.016        547.252        0.049         0.325     
4            65536        687.200      0.023        1016.028       0.048         0.324     
8            65536        1130.718     0.036        1667.773       0.051         0.322     
16           65536        2169.437     0.047        3200.491       0.080         0.322     
32           65536        4020.369     0.114        5928.139       0.074         0.322     
1            131072       1468.802     0.073        2183.414       0.122         0.327     
4            131072       2717.163     0.092        4031.236       0.160         0.326     
8            131072       4741.062     0.458        7026.366       0.140         0.325     
16           131072       9190.310     0.236        13626.361      0.214         0.326     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.float8_e4m3fn
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.203        0.014        0.224          0.006         0.090     
4            1024         0.343        0.014        0.439          0.020         0.218     
8            1024         0.553        0.004        0.666          0.012         0.169     
16           1024         0.756        0.003        0.874          0.006         0.135     
32           1024         1.506        0.004        1.680          0.009         0.104     
64           1024         2.956        0.004        3.250          0.009         0.090     
128          1024         5.401        0.004        5.845          0.007         0.076     
256          1024         10.445       0.004        11.319         0.011         0.077     
1            2048         0.496        0.004        0.691          0.010         0.282     
4            2048         0.792        0.004        1.076          0.009         0.263     
8            2048         1.641        0.004        2.100          0.010         0.219     
16           2048         3.784        0.007        4.829          0.010         0.216     
32           2048         5.999        0.006        7.486          0.019         0.199     
64           2048         11.642       0.006        14.629         0.019         0.204     
128          2048         17.980       0.004        22.220         0.018         0.191     
256          2048         36.190       0.004        44.846         0.010         0.193     
1            4096         1.684        0.005        2.415          0.010         0.303     
4            4096         4.180        0.007        5.845          0.012         0.285     
8            4096         4.792        0.006        6.518          0.013         0.265     
16           4096         7.157        0.008        9.625          0.014         0.256     
32           4096         15.818       0.005        21.162         0.011         0.253     
64           4096         29.672       0.031        39.760         0.009         0.254     
128          4096         62.813       0.005        83.832         0.025         0.251     
256          4096         125.496      0.011        167.590        0.019         0.251     
1            8192         6.236        0.005        8.951          0.011         0.303     
4            8192         13.785       0.018        19.453         0.012         0.291     
8            8192         19.379       0.018        27.153         0.015         0.286     
16           8192         25.352       0.023        35.415         0.018         0.284     
32           8192         61.165       0.004        86.269         0.015         0.291     
64           8192         112.484      0.008        157.491        0.015         0.286     
128          8192         222.514      0.014        311.309        0.020         0.285     
256          8192         480.541      0.518        674.053        0.027         0.287     
1            16384        23.868       0.006        34.811         0.016         0.314     
4            16384        36.067       0.005        52.410         0.018         0.312     
8            16384        55.148       0.005        80.176         0.013         0.312     
16           16384        72.556       0.007        104.160        0.015         0.303     
32           16384        173.144      0.010        248.913        0.019         0.304     
64           16384        399.187      0.015        573.269        0.045         0.304     
128          16384        856.077      0.020        1231.988       0.071         0.305     
1            32768        93.505       0.010        137.556        0.023         0.320     
4            32768        182.645      0.010        267.334        0.019         0.317     
8            32768        268.389      0.013        391.347        0.022         0.314     
16           32768        459.644      0.024        672.214        0.028         0.316     
32           32768        949.388      0.029        1388.101       0.027         0.316     
64           32768        1734.357     0.036        2536.578       0.043         0.316     
1            65536        369.469      0.017        547.286        0.084         0.325     
4            65536        687.046      0.028        1016.018       0.062         0.324     
8            65536        1130.251     0.037        1667.711       0.044         0.322     
16           65536        2168.585     0.049        3200.411       0.073         0.322     
32           65536        4019.123     0.102        5928.143       0.086         0.322     
1            131072       1469.114     0.080        2183.480       0.148         0.327     
4            131072       2717.344     0.087        4031.260       0.122         0.326     
8            131072       4741.420     0.868        7026.723       0.830         0.325     
16           131072       9191.084     2.467        13628.266      3.945         0.326     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.uint8
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.168        0.005        0.224          0.006         0.249     
4            1024         0.307        0.005        0.385          0.009         0.203     
8            1024         0.560        0.006        0.647          0.008         0.134     
16           1024         0.763        0.005        0.872          0.008         0.125     
32           1024         1.512        0.005        1.683          0.012         0.102     
64           1024         2.961        0.006        3.249          0.006         0.089     
128          1024         5.406        0.006        5.842          0.008         0.075     
256          1024         10.449       0.006        11.316         0.009         0.077     
1            2048         0.503        0.006        0.687          0.011         0.268     
4            2048         0.798        0.006        1.077          0.011         0.259     
8            2048         1.647        0.005        2.102          0.008         0.216     
16           2048         3.784        0.005        4.829          0.010         0.216     
32           2048         6.004        0.005        7.478          0.016         0.197     
64           2048         11.650       0.006        14.624         0.020         0.203     
128          2048         17.988       0.005        22.226         0.017         0.191     
256          2048         36.198       0.006        44.848         0.007         0.193     
1            4096         1.689        0.006        2.415          0.008         0.301     
4            4096         4.183        0.006        5.837          0.009         0.283     
8            4096         4.803        0.009        6.520          0.013         0.263     
16           4096         7.164        0.006        9.610          0.012         0.255     
32           4096         15.832       0.008        21.160         0.016         0.252     
64           4096         29.673       0.006        39.758         0.015         0.254     
128          4096         62.841       0.006        83.833         0.016         0.250     
256          4096         125.547      0.008        167.594        0.015         0.251     
1            8192         6.226        0.005        8.942          0.015         0.304     
4            8192         13.784       0.005        19.450         0.016         0.291     
8            8192         19.345       0.005        27.160         0.013         0.288     
16           8192         25.368       0.026        35.416         0.012         0.284     
32           8192         61.197       0.008        86.266         0.013         0.291     
64           8192         112.525      0.006        157.496        0.015         0.286     
128          8192         222.802      0.612        311.322        0.022         0.284     
256          8192         480.518      0.066        674.062        0.023         0.287     
1            16384        23.878       0.007        34.817         0.014         0.314     
4            16384        36.085       0.007        52.403         0.010         0.311     
8            16384        55.174       0.008        80.175         0.015         0.312     
16           16384        72.583       0.009        104.163        0.016         0.303     
32           16384        173.203      0.010        248.918        0.022         0.304     
64           16384        399.331      0.013        573.256        0.014         0.303     
128          16384        856.358      0.031        1231.987       0.020         0.305     
1            32768        93.533       0.007        137.565        0.024         0.320     
4            32768        182.690      0.009        267.349        0.019         0.317     
8            32768        268.479      0.013        391.353        0.019         0.314     
16           32768        459.813      0.014        672.232        0.025         0.316     
32           32768        949.696      0.025        1388.138       0.069         0.316     
64           32768        1734.845     0.026        2536.532       0.066         0.316     
1            65536        369.572      0.018        547.255        0.040         0.325     
4            65536        687.206      0.033        1016.012       0.030         0.324     
8            65536        1130.639     0.071        1667.740       0.049         0.322     
16           65536        2169.175     0.076        3200.392       0.100         0.322     
32           65536        4020.166     0.118        5928.197       0.111         0.322     
1            131072       1469.347     0.077        2183.452       0.105         0.327     
4            131072       2717.722     0.095        4031.260       0.120         0.326     
8            131072       4742.195     0.100        7026.247       0.076         0.325     
16           131072       9192.495     0.339        13626.595      0.174         0.325     



Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

jasonlizhengjian and others added 2 commits September 24, 2025 15:10
…ction

- Remove inappropriate 256 token limit for prefill sequences
- Keep existing 256 token limit for decode batches
- Add context-specific logging to distinguish prefill vs decode
- Fixes issue where prefill sequences > 256 tokens incorrectly fell back to FlashInfer

Signed-off-by: jasonlizhengjian <[email protected]>
Copy link

github-actions bot commented Oct 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the heuristic for using the FlashInfer TRTLLM attention kernel during prefill, removing the previous token limit of 256. This change is well-supported by the provided benchmarks, which show significant performance improvements. My review focuses on refining this heuristic to prevent potential out-of-memory errors. I've suggested adding a new token limit based on the successful benchmarked configurations to ensure stability while retaining the performance benefits.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmarks, that seems pretty clear. Have you also looked at the decode case as well? We are likely to see large decode batches on B200 given the default max_num_seqs is 1024

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 1, 2025
Copy link
Contributor

@elvischenv elvischenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are going to enable prefill kernel for BS>256. But the benchmark only shown BS<256. I believe the pref should be good, but can you show them?

@jasonlizhengjian
Copy link
Author

You are going to enable prefill kernel for BS>256. But the benchmark only shown BS<256. I believe the pref should be good, but can you show them?

here is with batch size up to 2048:


Running benchmark for q_dtype = torch.bfloat16, kv_cache_dtype: torch.bfloat16, output_dtype: torch.bfloat16
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.106        0.010        0.251          0.011         0.580     
1            1024         0.085        0.005        0.225          0.007         0.620     
4            1024         0.171        0.008        0.408          0.013         0.581     
4            1024         0.151        0.004        0.385          0.006         0.608     
8            1024         0.262        0.005        0.644          0.009         0.594     
8            1024         0.261        0.004        0.643          0.011         0.594     
16           1024         0.371        0.005        0.872          0.007         0.575     
16           1024         0.371        0.004        0.875          0.009         0.577     
32           1024         0.720        0.005        1.684          0.008         0.572     
32           1024         0.720        0.004        1.685          0.009         0.573     
64           1024         1.423        0.051        3.254          0.013         0.563     
64           1024         1.413        0.040        3.252          0.006         0.565     
128          1024         2.832        0.124        5.843          0.009         0.515     
128          1024         2.824        0.101        5.843          0.009         0.517     
256          1024         5.691        0.099        11.318         0.007         0.497     
256          1024         5.720        0.059        11.316         0.009         0.495     
512          1024         11.504       0.102        22.663         0.010         0.492     
1024         1024         23.698       0.127        46.171         0.016         0.487     
2048         1024         46.607       0.096        90.482         0.010         0.485     
1            2048         0.230        0.005        0.757          0.009         0.696     
4            2048         0.344        0.004        1.121          0.020         0.693     
8            2048         0.707        0.004        2.094          0.010         0.663     
16           2048         1.688        0.085        4.852          0.081         0.652     
32           2048         2.829        0.158        7.477          0.010         0.622     
64           2048         5.790        0.046        14.610         0.010         0.604     
128          2048         9.043        0.064        22.186         0.011         0.592     
256          2048         18.523       0.224        44.758         0.010         0.586     
512          2048         35.542       0.264        84.167         0.069         0.578     
1024         2048         70.682       0.171        168.802        0.077         0.581     
1            4096         0.680        0.005        2.610          0.055         0.739     
4            4096         1.769        0.072        5.857          0.029         0.698     
8            4096         2.119        0.110        6.534          0.024         0.676     
16           4096         3.332        0.143        9.605          0.013         0.653     
32           4096         7.574        0.128        21.123         0.013         0.641     
64           4096         14.192       0.096        39.681         0.014         0.642     
128          4096         30.590       0.274        83.658         0.016         0.634     
256          4096         61.134       0.313        167.600        0.018         0.635     
512          4096         123.325      0.314        335.092        0.278         0.632     
1            8192         2.681        0.153        8.959          0.017         0.701     
4            8192         6.259        0.059        19.451         0.017         0.678     
8            8192         8.759        0.069        27.158         0.014         0.677     
16           8192         11.714       0.144        35.414         0.013         0.669     
32           8192         28.710       0.281        86.261         0.011         0.667     
64           8192         52.886       0.198        157.488        0.017         0.664     
128          8192         105.127      1.045        311.325        0.035         0.662     
256          8192         225.979      0.295        674.096        0.066         0.665     
1            16384        10.551       0.117        34.815         0.015         0.697     
4            16384        16.335       0.259        52.433         0.054         0.688     
8            16384        25.001       0.294        80.182         0.019         0.688     
16           16384        33.308       0.310        104.168        0.018         0.680     
32           16384        79.811       0.474        248.932        0.022         0.679     
64           16384        184.203      0.214        573.282        0.043         0.679     
128          16384        393.760      0.540        1232.027       0.055         0.680     
1            32768        42.018       0.389        137.572        0.029         0.695     
4            32768        82.313       0.399        267.357        0.018         0.692     
8            32768        121.479      0.587        391.364        0.021         0.690     
16           32768        208.660      0.862        672.257        0.030         0.690     
32           32768        429.713      0.577        1388.193       0.104         0.690     
64           32768        786.759      1.228        2536.631       0.034         0.690     
1            65536        165.439      0.401        547.283        0.049         0.698     
4            65536        308.944      0.689        1016.233       0.789         0.696     
8            65536        508.776      0.606        1667.827       0.090         0.695     
16           65536        974.115      1.487        3200.457       0.106         0.696     
32           65536        1809.411     0.564        5929.095       2.048         0.695     
1            131072       666.344      0.679        2183.591       0.146         0.695     
4            131072       1230.011     1.359        4031.369       0.090         0.695     
8            131072       2139.787     1.615        7026.548       0.105         0.695     
16           131072       4180.238     2.160        13621.528      2.077         0.693     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.bfloat16
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.183        0.009        0.239          0.009         0.235     
4            1024         0.327        0.009        0.414          0.012         0.208     
8            1024         0.564        0.004        0.648          0.008         0.129     
16           1024         0.774        0.004        0.872          0.007         0.113     
32           1024         1.539        0.004        1.683          0.008         0.086     
64           1024         3.016        0.004        3.249          0.007         0.072     
128          1024         5.515        0.004        5.841          0.009         0.056     
256          1024         10.668       0.004        11.317         0.008         0.057     
512          1024         21.396       0.033        22.655         0.009         0.056     
1024         1024         43.156       0.005        46.166         0.010         0.065     
2048         1024         84.889       0.010        90.471         0.008         0.062     
1            2048         0.501        0.004        0.686          0.008         0.270     
4            2048         0.802        0.004        1.077          0.015         0.256     
8            2048         1.661        0.004        2.094          0.011         0.207     
16           2048         3.815        0.004        4.832          0.009         0.211     
32           2048         6.069        0.004        7.478          0.008         0.188     
64           2048         11.779       0.004        14.606         0.011         0.194     
128          2048         18.229       0.005        22.184         0.010         0.178     
256          2048         36.680       0.004        44.756         0.006         0.180     
512          2048         69.404       0.008        84.215         0.013         0.176     
1024         2048         138.821      0.009        168.571        0.151         0.176     
1            4096         1.691        0.004        2.415          0.008         0.300     
4            4096         4.198        0.004        5.836          0.011         0.281     
8            4096         4.819        0.004        6.517          0.012         0.261     
16           4096         7.206        0.004        9.616          0.065         0.251     
32           4096         15.930       0.004        21.123         0.011         0.246     
64           4096         29.901       0.005        39.681         0.016         0.246     
128          4096         63.282       0.006        83.665         0.017         0.244     
256          4096         126.430      0.009        167.562        0.110         0.245     
512          4096         253.506      0.008        334.898        0.019         0.243     
1            8192         6.232        0.004        8.938          0.019         0.303     
4            8192         13.811       0.005        19.430         0.034         0.289     
8            8192         19.398       0.005        27.146         0.029         0.285     
16           8192         25.451       0.005        35.427         0.015         0.282     
32           8192         61.390       0.006        86.271         0.015         0.288     
64           8192         112.926      0.008        157.502        0.019         0.283     
128          8192         223.928      0.283        311.317        0.030         0.281     
256          8192         482.216      0.025        674.070        0.046         0.285     
1            16384        23.895       0.005        34.816         0.014         0.314     
4            16384        36.135       0.007        52.407         0.016         0.310     
8            16384        55.230       0.006        80.173         0.017         0.311     
16           16384        72.715       0.007        104.157        0.013         0.302     
32           16384        173.534      0.010        248.921        0.020         0.303     
64           16384        400.088      0.019        573.297        0.079         0.302     
128          16384        857.811      0.039        1231.998       0.064         0.304     
1            32768        93.554       0.010        137.564        0.024         0.320     
4            32768        182.752      0.012        267.346        0.022         0.316     
8            32768        268.636      0.013        391.353        0.020         0.314     
16           32768        460.093      0.017        672.222        0.042         0.316     
32           32768        950.321      0.060        1388.119       0.037         0.315     
64           32768        1736.041     0.040        2536.647       0.099         0.316     
1            65536        369.510      0.016        547.252        0.049         0.325     
4            65536        687.200      0.023        1016.028       0.048         0.324     
8            65536        1130.718     0.036        1667.773       0.051         0.322     
16           65536        2169.437     0.047        3200.491       0.080         0.322     
32           65536        4020.369     0.114        5928.139       0.074         0.322     
1            131072       1468.802     0.073        2183.414       0.122         0.327     
4            131072       2717.163     0.092        4031.236       0.160         0.326     
8            131072       4741.062     0.458        7026.366       0.140         0.325     
16           131072       9190.310     0.236        13626.361      0.214         0.326     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.float8_e4m3fn
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.203        0.014        0.224          0.006         0.090     
4            1024         0.343        0.014        0.439          0.020         0.218     
8            1024         0.553        0.004        0.666          0.012         0.169     
16           1024         0.756        0.003        0.874          0.006         0.135     
32           1024         1.506        0.004        1.680          0.009         0.104     
64           1024         2.956        0.004        3.250          0.009         0.090     
128          1024         5.401        0.004        5.845          0.007         0.076     
256          1024         10.445       0.004        11.319         0.011         0.077     
512          1024         20.914       0.019        22.619         0.009         0.075     
1024         1024         42.225       0.006        46.085         0.009         0.084     
2048         1024         83.056       0.009        90.314         0.015         0.080     
1            2048         0.496        0.004        0.691          0.010         0.282     
4            2048         0.792        0.004        1.076          0.009         0.263     
8            2048         1.641        0.004        2.100          0.010         0.219     
16           2048         3.784        0.007        4.829          0.010         0.216     
32           2048         5.999        0.006        7.486          0.019         0.199     
64           2048         11.642       0.006        14.629         0.019         0.204     
128          2048         17.980       0.004        22.220         0.018         0.191     
256          2048         36.190       0.004        44.846         0.010         0.193     
512          2048         68.448       0.006        84.066         0.012         0.186     
1024         2048         136.711      0.014        168.460        0.019         0.188     
1            4096         1.684        0.005        2.415          0.010         0.303     
4            4096         4.180        0.007        5.845          0.012         0.285     
8            4096         4.792        0.006        6.518          0.013         0.265     
16           4096         7.157        0.008        9.625          0.014         0.256     
32           4096         15.818       0.005        21.162         0.011         0.253     
64           4096         29.672       0.031        39.760         0.009         0.254     
128          4096         62.813       0.005        83.832         0.025         0.251     
256          4096         125.496      0.011        167.590        0.019         0.251     
512          4096         251.148      0.015        334.905        0.018         0.250     
1            8192         6.236        0.005        8.951          0.011         0.303     
4            8192         13.785       0.018        19.453         0.012         0.291     
8            8192         19.379       0.018        27.153         0.015         0.286     
16           8192         25.352       0.023        35.415         0.018         0.284     
32           8192         61.165       0.004        86.269         0.015         0.291     
64           8192         112.484      0.008        157.491        0.015         0.286     
128          8192         222.514      0.014        311.309        0.020         0.285     
256          8192         480.541      0.518        674.053        0.027         0.287     
1            16384        23.868       0.006        34.811         0.016         0.314     
4            16384        36.067       0.005        52.410         0.018         0.312     
8            16384        55.148       0.005        80.176         0.013         0.312     
16           16384        72.556       0.007        104.160        0.015         0.303     
32           16384        173.144      0.010        248.913        0.019         0.304     
64           16384        399.187      0.015        573.269        0.045         0.304     
128          16384        856.077      0.020        1231.988       0.071         0.305     
1            32768        93.505       0.010        137.556        0.023         0.320     
4            32768        182.645      0.010        267.334        0.019         0.317     
8            32768        268.389      0.013        391.347        0.022         0.314     
16           32768        459.644      0.024        672.214        0.028         0.316     
32           32768        949.388      0.029        1388.101       0.027         0.316     
64           32768        1734.357     0.036        2536.578       0.043         0.316     
1            65536        369.469      0.017        547.286        0.084         0.325     
4            65536        687.046      0.028        1016.018       0.062         0.324     
8            65536        1130.251     0.037        1667.711       0.044         0.322     
16           65536        2168.585     0.049        3200.411       0.073         0.322     
32           65536        4019.123     0.102        5928.143       0.086         0.322     
1            131072       1469.114     0.080        2183.480       0.148         0.327     
4            131072       2717.344     0.087        4031.260       0.122         0.326     
8            131072       4741.420     0.868        7026.723       0.830         0.325     
16           131072       9191.084     2.467        13628.266      3.945         0.326     

Running benchmark for q_dtype = torch.float8_e4m3fn, kv_cache_dtype: torch.float8_e4m3fn, output_dtype: torch.uint8
batch_size   max_seq_len  trtllm_mean  trtllm_std   baseline_mean  baseline_std  speedup_% 
1            1024         0.168        0.005        0.224          0.006         0.249     
4            1024         0.307        0.005        0.385          0.009         0.203     
8            1024         0.560        0.006        0.647          0.008         0.134     
16           1024         0.763        0.005        0.872          0.008         0.125     
32           1024         1.512        0.005        1.683          0.012         0.102     
64           1024         2.961        0.006        3.249          0.006         0.089     
128          1024         5.406        0.006        5.842          0.008         0.075     
256          1024         10.449       0.006        11.316         0.009         0.077     
512          1024         20.895       0.006        22.620         0.008         0.076     
1024         1024         42.153       0.006        46.088         0.010         0.085     
2048         1024         82.905       0.009        90.316         0.012         0.082     
1            2048         0.503        0.006        0.687          0.011         0.268     
4            2048         0.798        0.006        1.077          0.011         0.259     
8            2048         1.647        0.005        2.102          0.008         0.216     
16           2048         3.784        0.005        4.829          0.010         0.216     
32           2048         6.004        0.005        7.478          0.016         0.197     
64           2048         11.650       0.006        14.624         0.020         0.203     
128          2048         17.988       0.005        22.226         0.017         0.191     
256          2048         36.198       0.006        44.848         0.007         0.193     
512          2048         68.456       0.033        84.063         0.010         0.186     
1024         2048         136.732      0.011        168.462        0.017         0.188     
1            4096         1.689        0.006        2.415          0.008         0.301     
4            4096         4.183        0.006        5.837          0.009         0.283     
8            4096         4.803        0.009        6.520          0.013         0.263     
16           4096         7.164        0.006        9.610          0.012         0.255     
32           4096         15.832       0.008        21.160         0.016         0.252     
64           4096         29.673       0.006        39.758         0.015         0.254     
128          4096         62.841       0.006        83.833         0.016         0.250     
256          4096         125.547      0.008        167.594        0.015         0.251     
512          4096         251.252      0.094        334.901        0.017         0.250     
1            8192         6.226        0.005        8.942          0.015         0.304     
4            8192         13.784       0.005        19.450         0.016         0.291     
8            8192         19.345       0.005        27.160         0.013         0.288     
16           8192         25.368       0.026        35.416         0.012         0.284     
32           8192         61.197       0.008        86.266         0.013         0.291     
64           8192         112.525      0.006        157.496        0.015         0.286     
128          8192         222.802      0.612        311.322        0.022         0.284     
256          8192         480.518      0.066        674.062        0.023         0.287     
1            16384        23.878       0.007        34.817         0.014         0.314     
4            16384        36.085       0.007        52.403         0.010         0.311     
8            16384        55.174       0.008        80.175         0.015         0.312     
16           16384        72.583       0.009        104.163        0.016         0.303     
32           16384        173.203      0.010        248.918        0.022         0.304     
64           16384        399.331      0.013        573.256        0.014         0.303     
128          16384        856.358      0.031        1231.987       0.020         0.305     
1            32768        93.533       0.007        137.565        0.024         0.320     
4            32768        182.690      0.009        267.349        0.019         0.317     
8            32768        268.479      0.013        391.353        0.019         0.314     
16           32768        459.813      0.014        672.232        0.025         0.316     
32           32768        949.696      0.025        1388.138       0.069         0.316     
64           32768        1734.845     0.026        2536.532       0.066         0.316     
1            65536        369.572      0.018        547.255        0.040         0.325     
4            65536        687.206      0.033        1016.012       0.030         0.324     
8            65536        1130.639     0.071        1667.740       0.049         0.322     
16           65536        2169.175     0.076        3200.392       0.100         0.322     
32           65536        4020.166     0.118        5928.197       0.111         0.322     
1            131072       1469.347     0.077        2183.452       0.105         0.327     
4            131072       2717.722     0.095        4031.260       0.120         0.326     
8            131072       4742.195     0.100        7026.247       0.076         0.325     
16           131072       9192.495     0.339        13626.595      0.174         0.325     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants