UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF #118

DajanaV · 2025-11-07T13:39:19Z

Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF

Model	Microbatch size	Test	t/s master	t/s 672492fc	Speedup
qwen3 8B Q8_0	1	pp512	46.48	54.61	1.18
qwen3 8B Q8_0	2	pp512	89.96	85.92	0.96
qwen3 8B Q8_0	3	pp512	132.92	126.23	0.95
qwen3 8B Q8_0	4	pp512	176.06	166.12	0.94
qwen3 8B Q8_0	5	pp512	212.00	197.77	0.93
qwen3 8B Q8_0	6	pp512	252.54	233.83	0.93
qwen3 8B Q8_0	7	pp512	289.87	266.58	0.92
qwen3 8B Q8_0	8	pp512	318.56	290.63	0.91
qwen3 8B Q8_0	9	pp512	344.41	314.93	0.91
qwen3 8B Q8_0	10	pp512	377.97	342.75	0.91
qwen3 8B Q8_0	11	pp512	416.42	373.85	0.90
qwen3 8B Q8_0	12	pp512	447.61	398.83	0.89
qwen3 8B Q8_0	13	pp512	486.83	429.74	0.88
qwen3 8B Q8_0	14	pp512	525.24	458.88	0.87
qwen3 8B Q8_0	15	pp512	555.91	482.08	0.87
qwen3 8B Q8_0	16	pp512	580.07	512.47	0.88

loci-agentic-ai · 2025-11-07T14:16:25Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of project_id 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing version 2805f4ce-7f2f-4355-ab87-b572e76e81a6 against baseline 0797ab8c-9bfc-4911-8c5b-22da73432e86 reveals minimal performance variations with no impact on core inference functions.

Key Findings

Performance Metrics:

Highest Response Time change: _ZNSt7__cxx1112regex_traitsIcE10_RegexMaskC1Eth in build.bin.llama-run with -0.08% improvement (0.018 ns)
Highest Throughput degradation: _ZNSt14_Optional_baseIN22common_chat_msg_parser17find_regex_resultELb0ELb0EEC1IJS1_ELb0EEESt10in_place_tDpOT_ in build.bin.llama-tts with +0.17% increase (0.040 ns)

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The modified functions are C++ standard library components unrelated to LLM inference pipelines, therefore no impact on tokens per second performance.

Power Consumption Analysis:
Minimal power consumption changes across all binaries (≤0.001%). Largest change in build.bin.libllama.so with -0.0003% reduction (-0.91 nJ). Changes fall within measurement noise levels, indicating stable energy characteristics.

Flame Graph and CFG Analysis:
The _ZNSt7__cxx1112regex_traitsIcE10_RegexMaskC1Eth function shows identical assembly code between versions with a flat execution profile (single 22 ns stack frame). The 0.01 ns timing difference stems from micro-architectural variations rather than code-level optimizations, confirming the improvement is within statistical noise.

GitHub Code Review:
PR #118 introduces RDNA4 tensor core support for AMD GPUs. The performance changes in standard library functions are indirect effects of compilation changes from new template instantiations and conditional compilation paths. No regressions identified in the RDNA4 implementation.

Conclusion:
The analysis reveals stable performance with negligible variations in non-critical functions. Core inference capabilities remain unaffected, with no actionable performance optimizations required for the current changes.

zhang hui added 2 commits November 7, 2025 21:22

mmf for rdna4

2f7cfcf

align the padding for rdna4

d564a35

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 13:39 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 12 times, most recently from 6b50572 to 733e776 Compare November 8, 2025 21:07

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

0ec241d

DajanaV had a problem deploying to PROD__AL_DEMO November 9, 2025 05:35 — with GitHub Actions Failure

DajanaV force-pushed the main branch from 733e776 to 2c7fec2 Compare November 9, 2025 07:08

forbit mul_mat_f for rdna4

bbee5fe

DajanaV force-pushed the main branch 10 times, most recently from 6d2349e to 9248736 Compare November 10, 2025 11:08

DajanaV force-pushed the main branch 8 times, most recently from 1a27925 to 98e1e20 Compare November 10, 2025 23:08

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

6b8ceeb

DajanaV had a problem deploying to PROD__AL_DEMO November 11, 2025 06:43 — with GitHub Actions Failure

DajanaV force-pushed the main branch 8 times, most recently from 20900e4 to 2e1e7c4 Compare November 12, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF #118

UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF #118

Uh oh!

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF #118

Are you sure you want to change the base?

UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF #118

Uh oh!

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants