Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17069

(alternative to #17064, cc @ngxson)

This adds support for a few formats in the compressed-tensors quant method.

I've also re-tested plain fp8 with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.

I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.


Make sure to read the contributing guidelines before submitting a PR

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 17eb8e97 compared to baseline 52cd5469 reveals minimal performance variations across the llama.cpp codebase. The changes are primarily related to Python conversion scripts for compressed-tensors quantization support, with no direct modifications to core C++ inference functions.

Key Findings

Performance Metrics:

  • Highest Response Time Change: std::vector<llm_bigram_spm>::pop_back() improved by 0.10% (67 ns → 67 ns, -0.067 ns absolute)
  • Highest Throughput Change: std::make_unique<llm_graph_input_pos_bucket>() degraded by 0.12% (104 ns → 104 ns, +0.122 ns absolute)

Core Function Impact:
No core inference functions (llama_decode, llama_encode, llama_tokenize) show measurable performance changes. The observed variations occur in STL utility functions used during tokenization preprocessing, not in the primary inference pipeline. Tokens per second performance remains unaffected as no critical path functions experienced meaningful response time or throughput changes.

Power Consumption Analysis:
All binaries show negligible power consumption changes (<0.001%):

  • libllama.so: -0.0009 nJ
  • llama-run: -0.0012 nJ
  • llama-cvector-generator: +0.0037 nJ
  • llama-tts: -0.0001 nJ

Energy efficiency remains stable across all components.

Flame Graph and CFG Analysis:
The pop_back() function exhibits a simple single-frame execution profile with identical assembly code between versions. The 0.06 ns improvement represents measurement variance rather than algorithmic changes, as both versions execute identical instruction sequences with no structural differences in control flow.

GitHub Code Review Insights:
The PR introduces compressed-tensors quantization support in Python conversion scripts without affecting C++ runtime performance. Changes include new dequantization methods and lazy tensor operator fixes that improve model conversion robustness but don't impact inference execution.

Conclusion:
The analysis reveals stable performance characteristics with variations within measurement noise. No actionable performance optimizations are required as the changes maintain inference efficiency while expanding quantization format support.

@DajanaV DajanaV force-pushed the main branch 22 times, most recently from 0ad40ce to 0fa8f01 Compare November 10, 2025 09:10
@DajanaV DajanaV force-pushed the main branch 18 times, most recently from 20900e4 to 2e1e7c4 Compare November 12, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants