sync : ggml #3428

ggerganov · 2025-09-20T10:49:33Z

Huge sync incoming.

It would be a miracle if everything works correctly after this 🤞

ggml-ci

…ama/15385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <[email protected]> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <[email protected]> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: mgiessing <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

…ama/15375)

* musa: fix build warnings Signed-off-by: Xiaodong Ye <[email protected]> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

These detailed strings were causing increased build time on gcc.

…(llama/15346)

Signed-off-by: Xiaodong Ye <[email protected]>

* vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used

Co-authored-by: aeseulgi <[email protected]>

…pt processing (llama/15488)

* [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> * fix review comment Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* add conv3d * bump GGML_OP_COUNT

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

…llama/15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.

…le SMs (llama/15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums

* vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

…5526)

The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.

… (llama/15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

* SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested

…5956) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.

* CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id

* remove unsupported vulkan devices * make this happen during selection instead * pass by reference

…a/16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <[email protected]>

Signed-off-by: noemotiovon <[email protected]>

ggml-ci

* metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci

* metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci

* CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <[email protected]>

- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <[email protected]>

When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <[email protected]>

Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <[email protected]>

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

ggerganov and others added 30 commits September 20, 2025 13:42

cuda : remove obsolete sources (ggml/1332)

80447f7

ggml-ci

musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…

02b49af

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

CANN: optimize rope operator (llama/15335)

2572322

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

opencl: mark argsort unsupported if cols exceed workgroup limit (ll…

db1d238

…ama/15375)

vulkan: shorten pipeline name strings (llama/15431)

5907ab3

These detailed strings were causing increased build time on gcc.

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)

316ed78

CUDA: refactor FA support/selection code (llama/15454)

8f0579a

sched : copy only the used experts when offloading prompt processing …

622dec5

…(llama/15346)

musa: add GGML_UNUSED_VARS (llama/15446)

7c07784

Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix condition of im2col on Metal backend (llama/15460)

c5874bc

vulkan: add exp operation (llama/15456)

7eebd49

Co-authored-by: aeseulgi <[email protected]>

vulkan : support conv_2d_dw with f16 weights (llama/15392)

9dd5039

sched : fix possible use of wrong ids tensor when offloading moe prom…

554f96f

…pt processing (llama/15488)

cuda : add Pad Reflect 1D support (llama/14659)

18ca4e8

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml: add conv3d op (llama/15182)

d7b7498

* add conv3d * bump GGML_OP_COUNT

vulkan: optimize mul_mat_id loading row ids into shared memory (llama…

485c5c3

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

CUDA: fix half2 -> half conversion for HIP (llama/15529)

b0d15e1

vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)

2781786

* vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…

8c7872d

…5526)

yael-works and others added 25 commits September 20, 2025 13:45

vulkan: automatically remove unsupported devices (llama/15976)

e96b285

* remove unsupported vulkan devices * make this happen during selection instead * pass by reference

CUDA: fix FA occupancy, optimize tile kernel (llama/15982)

d452f0c

sync : ggml

6458bac

metal : refactor + optimize v2 (llama/15995)

eb2c01f

CANN: Remove print (llama/16044)

c46adc0

Signed-off-by: noemotiovon <[email protected]>

metal : handle nil cv during pipeline creation (llama/16065)

1f24b1d

ggml-ci

metal : avoid call free for non-owned buffer (llama/16067)

32b6d9c

metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)

d37f590

* metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci

cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)

225d7c1

metal : use function constants for mul_mv_ext kernels (llama/16074)

960aaa9

* metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci

CUDA: fix compilation on CC 6.0 (llama/16091)

05bdfd4

rename optimize_graph to graph_optimize (llama/16082)

7fcb7e8

opencl: optimize mxfp4 kernels (llama/16037)

f4a225c

- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <[email protected]>

ggml : refactor forward_dup for cpu backend (llama/16062)

2ad00d5

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

vulkan: use vec dot for matrix matrix multiplications (llama/16056)

76d0934

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

sync : ggml

66ad624

talk-llama : sync llama.cpp

36778bd

ggerganov force-pushed the sync-ggml-25-09-20 branch from d94f214 to 36778bd Compare September 20, 2025 11:01

ggerganov added 3 commits September 20, 2025 18:35

metal : restore im2col performance (#0)

01419ae

metal : fuse NORM + MUL + ADD (#0)

e042cc0

metal : enable reordering of CPY and SET_ROWS (#0)

aa10f77

danbev approved these changes Sep 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3428

sync : ggml #3428

Uh oh!

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!

sync : ggml #3428

Are you sure you want to change the base?

sync : ggml #3428

Uh oh!

Conversation

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!