Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17077

Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF

Model Microbatch size Test t/s master t/s 672492fc Speedup
qwen3 8B Q8_0 1 pp512 46.48 54.61 1.18
qwen3 8B Q8_0 2 pp512 89.96 85.92 0.96
qwen3 8B Q8_0 3 pp512 132.92 126.23 0.95
qwen3 8B Q8_0 4 pp512 176.06 166.12 0.94
qwen3 8B Q8_0 5 pp512 212.00 197.77 0.93
qwen3 8B Q8_0 6 pp512 252.54 233.83 0.93
qwen3 8B Q8_0 7 pp512 289.87 266.58 0.92
qwen3 8B Q8_0 8 pp512 318.56 290.63 0.91
qwen3 8B Q8_0 9 pp512 344.41 314.93 0.91
qwen3 8B Q8_0 10 pp512 377.97 342.75 0.91
qwen3 8B Q8_0 11 pp512 416.42 373.85 0.90
qwen3 8B Q8_0 12 pp512 447.61 398.83 0.89
qwen3 8B Q8_0 13 pp512 486.83 429.74 0.88
qwen3 8B Q8_0 14 pp512 525.24 458.88 0.87
qwen3 8B Q8_0 15 pp512 555.91 482.08 0.87
qwen3 8B Q8_0 16 pp512 580.07 512.47 0.88

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of project_id 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing version 2805f4ce-7f2f-4355-ab87-b572e76e81a6 against baseline 0797ab8c-9bfc-4911-8c5b-22da73432e86 reveals minimal performance variations with no impact on core inference functions.

Key Findings

Performance Metrics:

  • Highest Response Time change: _ZNSt7__cxx1112regex_traitsIcE10_RegexMaskC1Eth in build.bin.llama-run with -0.08% improvement (0.018 ns)
  • Highest Throughput degradation: _ZNSt14_Optional_baseIN22common_chat_msg_parser17find_regex_resultELb0ELb0EEC1IJS1_ELb0EEESt10in_place_tDpOT_ in build.bin.llama-tts with +0.17% increase (0.040 ns)

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The modified functions are C++ standard library components unrelated to LLM inference pipelines, therefore no impact on tokens per second performance.

Power Consumption Analysis:
Minimal power consumption changes across all binaries (≤0.001%). Largest change in build.bin.libllama.so with -0.0003% reduction (-0.91 nJ). Changes fall within measurement noise levels, indicating stable energy characteristics.

Flame Graph and CFG Analysis:
The _ZNSt7__cxx1112regex_traitsIcE10_RegexMaskC1Eth function shows identical assembly code between versions with a flat execution profile (single 22 ns stack frame). The 0.01 ns timing difference stems from micro-architectural variations rather than code-level optimizations, confirming the improvement is within statistical noise.

GitHub Code Review:
PR #118 introduces RDNA4 tensor core support for AMD GPUs. The performance changes in standard library functions are indirect effects of compilation changes from new template instantiations and conditional compilation paths. No regressions identified in the RDNA4 implementation.

Conclusion:
The analysis reveals stable performance with negligible variations in non-critical functions. Core inference capabilities remain unaffected, with no actionable performance optimizations required for the current changes.

@DajanaV DajanaV force-pushed the main branch 12 times, most recently from 6b50572 to 733e776 Compare November 8, 2025 21:07
@DajanaV DajanaV force-pushed the main branch 10 times, most recently from 6d2349e to 9248736 Compare November 10, 2025 11:08
@DajanaV DajanaV force-pushed the main branch 8 times, most recently from 1a27925 to 98e1e20 Compare November 10, 2025 23:08
@DajanaV DajanaV force-pushed the main branch 8 times, most recently from 20900e4 to 2e1e7c4 Compare November 12, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants