-
Notifications
You must be signed in to change notification settings - Fork 157
BF16 Trellis implementation #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With that, we get PP-512 = 234 t/s, so prompt processing is now in the low range of row-interleaved quants.
With that, we get PP-512 = 233 t/s.
With that, we get PP-512 = 240 t/s.
|
Just did some a/b testing with llama-sweep-bench on my home rig using that new Qwen3-8B dense model distillation of R1-0528.
Full GPU OffloadCPU Only👈 Details and LogsTest Quantsllama-sweep-benchFull GPU Offload$ git checkout main
$ git rev-parse --short HEAD
7a8abe29
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
#model=/mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_K.gguf
model=/mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_KT.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 32768 \
-ngl 99 \
--threads 1 \
--warmup-batchCPU Only# main test case
$ git checkout main
$ git rev-parse --short HEAD
7a8abe29
# PR484 ik/trellis_bf16 test case
$ git checkout ik/trellis_bf16
$ git rev-parse --short HEAD
061d064b
cmake -B build -DGGML_CUDA=OFF -DGGML_BLAS=OFF
cmake --build build --config Release -j $(nproc)
# with and without -rtr test cases
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 8704 \
--threads 16 \
--warmup-batchFull Crash LogsEDIT without rebooting it ran clean twice then the third time blew up again with: CPU Only 7965WXIt took just under 8 hours to slow cook ./build/bin/llama-sweep-bench \
--model ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ2_KT.gguf \
--ctx-size 4608 \
-mla 3 -fa \
-amb 512 \
-fmoe \
--threads 24 \
--warmup-batch \
--no-mmap
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q5_0: 61 tensors attn_k_b (crashes if u try to quantize to iq4_kt)
llama_model_loader: - type iq2_kt: 116 tensors ffn_(up|gate)_exps
llama_model_loader: - type iq3_kt: 58 tensors ffn_down_exps
llama_model_loader: - type iq4_kt: 551 tensors attn/shexp/token_embdHappy to try out anything to reproduce and hope it isn't a Heisenbug... Also, I was considering cooking a hybrid iq4_kt attn/shexp with iq3_k/iq2_k down/(up|gate) R1-0528, but with this speed-up to CPU inferencing I'll go all in with iq3_kt/iq2_kt down/(gate|up) just to see what happens. Gonna take a while to cook though! Thanks! |
I'm fairly certain that means there is a NaN somewhere in the calculations. |
|
Thank for testing. Yes, this assert is always associated with a NaN somewhere else. I ran into NaNs with the Looking at the low GPU TG performance, my guess is that you need to explicitly enable |
|
I didn't run into that assert in limited testing a mixes of iqN_kt with DeepSeek-R1-0528 on two remote systems fwiw. This PR did speed up CPU only compiled inferencing but couldn't test CUDA offload as described. Accidently updated my above comment before realizing you'd already commented. Its past my bed time hah.
That did the trick for the Thanks! |
|
I hadn't tested this PR with a DeepSeek model. Testing now I see DeepSeek-Lite breaks with |
|
Something goes wrong on CUDA too with DeepSeek-Lite. So, it seems, trellis quants are not quite ready for prime time yet. |
|
Closing in favor of #529 |




This PR adds a
bf16CPU implementation for the trellis quantsIQ2_KT, IQ3_KTandIQ4_KTfor CPUs with nativebf16support.We get massive gains in prompt processing speeds, and a ~5-10% gain in TG performance. On my Ryzen-7950X CPU that supports
bf16, all 3 types now have PP-512 in the range of 230-240 t/s for 8B LLaMA-3. This makes them comparable to row-interleaved quants (where PP-512 performance on this CPU is in the 240-300 t/s range).TG-128 performance for 8B LlaMA-3 on the Ryzen-7950X changes as follows
PP-512 performance for 8B LlaMA-3 on the Ryzen-7950X changes as follows
A similar optimization can be done for CPUs with native
fp16support, but as I don't have access to one of those, this is not implemented for now.