Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 6, 2025

Mirrored from ggml-org/llama.cpp#17051

extended MMF_ROWS_PER_BLOCK in mmf to more than warp_size, just keep MMF_ROWS_PER_BLOCK to the old value as I don't do the performance tuning.

Has been tested MMF_ROWS_PER_BLOCK = 64 on my 3080, no enough shared memory when MMF_ROWS_PER_BLOCK = 128

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of PR #104 reveals a context mismatch between the identified performance changes and actual code modifications. The performance analysis flagged llama_supports_rpc with the highest Response Time change (+0.08%), but this function remains unmodified. The PR exclusively targets CUDA matrix multiplication optimizations in ggml/src/ggml-cuda/mmf.cuh.

Key Findings

Performance Metrics:

  • Highest Response Time Change: llama_supports_rpc (+0.08%, +0.08 ns absolute)
  • Highest Throughput Change: llama_supports_rpc (+0.11%, +0.02 ns absolute)
  • Core Function Impact: None - no changes to inference-critical functions (llama_decode, llama_encode, llama_tokenize)

Tokens Per Second Impact:
No impact on inference performance. The observed changes in llama_supports_rpc (a capability check function) do not affect tokenization or inference pathways that would influence tokens per second throughput.

Power Consumption Analysis:
Negligible changes across all binaries:

  • build.bin.libllama.so: -0.0003% (-0.81 nJ)
  • build.bin.llama-run: -0.0002% (-0.58 nJ)
  • build.bin.llama-tts: -0.0001% (-0.39 nJ)
  • build.bin.llama-cvector-generator: +0.0003% (+0.95 nJ)

Technical Analysis:

  • Flame Graph: Shows simple two-level execution (29 ns total, 76% self-time, 24% PLT call overhead)
  • CFG Comparison: Identical control flow and assembly code between versions
  • Root Cause: Performance variance attributed to compiler optimization differences rather than functional changes

Code Review Insights:
PR #104 implements well-structured CUDA kernel optimizations extending MMF_ROWS_PER_BLOCK beyond warp size limitations. The changes enable processing 64 rows per thread block (vs 32) with proper bounds checking and static assertions.

Conclusion:
The observed performance changes represent measurement variance within normal compiler optimization bounds. No actionable performance regressions identified. The CUDA optimizations should improve matrix multiplication performance on capable hardware without affecting CPU inference paths.

@DajanaV DajanaV force-pushed the main branch 24 times, most recently from 6d2349e to 9248736 Compare November 10, 2025 11:08
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from e97d4a6 to 29827de Compare November 15, 2025 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants