Layers for MNIST example by ypapadop-amd · Pull Request #177 · ypapadop-amd/ggml

ypapadop-amd · 2026-03-09T20:11:44Z

No description provided.

* CMake support for HSA backend * Stubs for initializing HSA * Stubs for HSA buffers * Stubs for HSA and host buffers * Using new backend CMake declaration * Additional stubs for HSA backend * Formatting * Adding function to track unimplemented APIs

* Identifying memory pools * Support for buffer type alignment and max size * Cache memory properties * Comments * Using fixed-width integrals * Buffer allocation support

* Adding HSA backend to examples/simple-backend * Adding HSA backend to backend registration * Adding Eigen as a temporary matrix mulmat implementation * Support for device type reporting * Support for free and total memory reporting * Properly reporting which kernels are supported

* More function implementations, cleanups * Remove redundant information, catch exception

* Correcting comment * Add description from agent name * Implementation of backend get_device_description * Comments * Adding cpu backend as fallback for all ops * Marking which functions can be improved and correct guid * Remove needs-implementation marker on more functions * Hide cpu backend internally * Remove extra header

* HSA backend in test-mul-mat * HSA backend in gpt-2-backend * Offloading to CPU backend if operation not implemented

* Creating HSA queue * Zero-init all members * Adding signal support

* Add HSA backend to GPT-2 example * Remove CPU backend from HSA * Returning that it is host buffer for NPU memory * Adding CPY kernels and factoring out kernel code * Formatting, comments * Temporary storage for cpy * Extracting supports_op conditions * Renaming function

* Add option for CPU fallback in CMakeLists * Adding fallback to CPU backend if operation is not supported

* Add operation example * Using tensor count variable * Count source tensors and copy name * Detect if execution failed * Switching test to int32_t * add kernel using XRT * Aligning example size with kernel * Adding dev heap pool * Using HSA in add kernel * Using relaxed write to queue * Remove XRT dependency * Size independent test * Correct elements for kernel * Moving load functions to common.h * Using simplified AIE packet * Moving loading to a kernel registry * Adding constructor * Add kernel * Refactoring add script * Single name for PDI and instr.txt * Refactoring * Generalizing add.py * Adding dims * Comments and error checking * Modularize python script * CMake kernel generation * Remove magic numbers and use GGML data type naming * Adding a structure for NPU kernels and free function * Accepting only contiguous tensors for now * Stub for keeping loaded kernel in context * Passing device info as parameter, renaming contexts for easy filtering * Renaming variables * Reworking example * Using registry of kernels * Using HSA agent name for kernels * Using dladdr to get the kernel directory

* Using static instead of anonymous namespace * Handling exceptions

* clang-format configuration loosely based on ggml-sycl * Formatting

* Comments, disabling copy/move when not allowed * Replacing high / low bits macros * Factoring out dispatch functionality * Free all finished packets

* Missing checks in example * Adding init_tensor support

* Vector add for floats * Handling higher dims upon load * Move tensor testing in the operation supports function * mul_mat kernel compilation * Smaller gemm * Renaming args * Smaller gemm * Copy instead of moving PDI * Unify cmake kernel generation functions * Using latest CMake Python integration * Missing CMake HSA integration * Install kernels * Adding missing dependencies * Updating test to use HSA conditionally. HSA-specific mul_mat test * Encoding all dims in kernel filename

* Using new compilation process in CMake * Loading insts as binary

* Renaming dispatch function * Track allocated memory for packets

* Avoid warnings * Adding extra data to HSA backend tensors * Caching kernel in tensor extra metadata

* Renaming pending data functions. Refactor packet dispatch * Guard all CPU fallback cases * Internal nodes do not init extra until after graph allocation * Assert cleanups * Comments * Separate CMake support

* Add expected find_package definitions * Expose both C and C++ Peano compilers * Remove unused property * Relocating kernels * Output kernels for a device in a directory * Explicit names for Peano compilation

* Python script fixes * Separating kernel discovery to its own header

* Remove conservative asserts * Removing cpy kernel. Delegating to the CPU device for supports_op * Extracting types in example * Create completely independent fallback graph * Correct source tensor iteration * Better messages * Caching emulated tensors

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR refactors the HSA backend AIE kernels to consolidate shared math utilities (scalar_exp, scalar_log, pow2, floor_log2, compute_alibi_slope, and vec_exp) into a single shared header (aie_kernel_math.h), and removes the alignment restriction for the cross-entropy loss kernel by replacing the strict "must be divisible by KERN_VEC_SIZE" check with scalar tail loops that handle the remaining elements.

Changes:

Shared math utility functions moved into aie_kernel_math.h with #pragma once guard, replacing per-file inline definitions and the old #ifndef _AIE_KERNEL_MATH_ guard.
Cross-entropy loss kernel updated to support non-aligned row sizes via scalar tail loops in all three passes (find max, accumulate sum_exp, accumulate loss).
Python orchestration for cross-entropy loss simplified: alignment validation and the COMPILE_GGML_OP_CROSS_ENTROPY_LOSS compile flag removed.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`src/ggml-hsa/kernels/iron/aie_kernel_math.h`	Converts to `#pragma once`, adds `scalar_exp`/`scalar_log`, and consolidates `pow2`, `floor_log2`, `compute_alibi_slope` previously defined in `softmax.cc`
`src/ggml-hsa/kernels/iron/softmax.cc`	Removes the now-shared `pow2`, `floor_log2`, and `compute_alibi_slope` functions (moved to `aie_kernel_math.h`)
`src/ggml-hsa/kernels/iron/cross_entropy_loss.cc`	Removes `scalar_log` (moved to `aie_kernel_math.h`) and adds scalar tail loops to handle non-aligned sizes; removes the `#ifdef COMPILE_GGML_OP_CROSS_ENTROPY_LOSS` guard
`src/ggml-hsa/kernels/iron/cross_entropy_loss.py`	Removes alignment check and the `COMPILE_GGML_OP_CROSS_ENTROPY_LOSS` compile flag; `tile_size` is now simply `row_length`

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-09T20:17:30Z

+    constexpr int NUM_POW2_COEFFS = sizeof(pow2_coeffs) / sizeof(pow2_coeffs[0]);
+
+    // compute 2^f using Horner's method for Taylor series of exp(f * ln(2))
+    float exp_f = pow2_coeffs[0];
+
+#pragma unroll
+    for (int j = 1; j < NUM_POW2_COEFFS; ++j) {


In the pow2 function, NUM_POW2_COEFFS is declared as constexpr int (line 237) and the loop variable is int (line 243). The rest of this PR consistently uses int32_t for local integer variables (e.g., NUM_EXP_COEFFS on line 168 and loop variables in cross_entropy_loss.cc). These should be changed to int32_t for consistency with the surrounding new code.

Copilot · 2026-03-09T20:17:30Z

+    float m1 = pow2(-(max_bias / 2.0f) / n_head_log2);
+
+    float slope;
+    if (head_idx < n_head_log2) {


In compute_alibi_slope, line 304 compares head_idx (type int32_t) with n_head_log2 (type uint32_t), which is a signed/unsigned comparison. If head_idx could ever be negative (it's passed as a int32_t), the comparison would produce incorrect results due to implicit conversion. Consider casting head_idx to uint32_t before the comparison, or document that a non-negative value is required and add an assertion.

Copilot · 2026-03-09T20:17:30Z

 #include "aie_kernel_math.h"
 #include "ggml-aie.hpp"


In cross_entropy_loss.cc, #include "ggml-aie.hpp" (line 3) is now redundant because aie_kernel_math.h (line 4) already includes it via its own #include "ggml-aie.hpp". While harmless due to #pragma once, the redundant include could be removed to keep the include list clean.

Copilot · 2026-03-09T20:17:31Z

+    # Tile size equals row length; the kernel handles non-aligned sizes
+    # via scalar tail loops after vectorized processing.
+    tile_size = row_length


The PR title is "Layers for MNIST example", but the actual code changes are all in HSA kernel infrastructure files (src/ggml-hsa/kernels/iron/): refactoring shared math utilities into aie_kernel_math.h and enabling non-aligned row sizes in the cross-entropy loss kernel. There is no visible connection to the MNIST example in the diff. If the PR description is correct about the purpose, there may be missing changes (e.g., actual MNIST layer additions), or the PR title may be inaccurate.

Copilot · 2026-03-09T20:17:31Z

+    int32_t bits = reinterpret_cast<int32_t &>(x);
+    int32_t e_int = ((bits >> 23) & 0xFF) - 127;
+    float e = static_cast<float>(e_int);
+
+    int32_t m_bits = (bits & 0x007FFFFF) | 0x3F800000;
+    float m = reinterpret_cast<float &>(m_bits);


In scalar_log (lines 71, 76) and pow2 (line 252), type punning is performed using reinterpret_cast<T&>, which is undefined behavior in standard C++. By contrast, the newly added scalar_exp function in the same file correctly uses std::memcpy (line 50) for the equivalent operation. All three type-punning sites should use std::memcpy for consistency and correctness.

Copilot · 2026-03-09T20:17:31Z

+    // create the integer representation of 2^i
+    int32_t bits = (127 + i) << 23;
+    // cast the bits directly to float
+    float scale = reinterpret_cast<float &>(bits);


In pow2, the reinterpret_cast<float &>(bits) at line 252 is undefined behavior in standard C++. The newly added scalar_exp function in the same header uses std::memcpy for the equivalent bit-reinterpretation at line 50. This type-punning site should also be replaced with std::memcpy for safety and consistency.

- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ypapadop-amd added 30 commits February 27, 2026 14:43

HSA backend initialization (#3)

cbd41af

* CMake support for HSA backend * Stubs for initializing HSA * Stubs for HSA buffers * Stubs for HSA and host buffers * Using new backend CMake declaration * Additional stubs for HSA backend * Formatting * Adding function to track unimplemented APIs

Adding agent information to internal datastructures (#4)

e39d0d2

Memory pool information and buffer allocation support (#5)

d6e92d7

* Identifying memory pools * Support for buffer type alignment and max size * Cache memory properties * Comments * Using fixed-width integrals * Buffer allocation support

HSA tensor data functions (#6)

d3fb964

Fix for correctly choosing memories and reporting size (#8)

0020fa1

HSA backend function implementations (#9)

78ed6e9

* More function implementations, cleanups * Remove redundant information, catch exception

HSA operations fallback to CPU (#11)

69afc42

* HSA backend in test-mul-mat * HSA backend in gpt-2-backend * Offloading to CPU backend if operation not implemented

HSA queue support (#12)

7fb5f13

* Creating HSA queue * Zero-init all members * Adding signal support

Use GGML_ABORT instead of abort() (#13)

9eec22f

CPU backend fallback (#15)

7045076

* Add option for CPU fallback in CMakeLists * Adding fallback to CPU backend if operation is not supported

Replacing mutex and bool with call_once (#17)

a17ed35

Exception and error handling (#25)

d6759cf

* Using static instead of anonymous namespace * Handling exceptions

Clang format support (#26)

e7c41d0

* clang-format configuration loosely based on ggml-sycl * Formatting

Free packet memory after synchronization (#28)

0e3ba5b

* Comments, disabling copy/move when not allowed * Replacing high / low bits macros * Factoring out dispatch functionality * Free all finished packets

ggml_backend_buffer_init_tensor support (#29)

54fcc2d

* Missing checks in example * Adding init_tensor support

Using alias for std::filesystem (#30)

4137c2e

Binary instructions format loading (#32)

5103ca8

* Using new compilation process in CMake * Loading insts as binary

Release memory allocated for packets (#33)

d1dbf00

* Renaming dispatch function * Track allocated memory for packets

Directly create PDI instead through xclbin (#34)

80261cf

HSA tensor extra metadata (#37)

5d44e05

* Avoid warnings * Adding extra data to HSA backend tensors * Caching kernel in tensor extra metadata

Various fixes (#38)

6be1343

* Renaming pending data functions. Refactor packet dispatch * Guard all CPU fallback cases * Internal nodes do not init extra until after graph allocation * Assert cleanups * Comments * Separate CMake support

CMake Refactor (#39)

0ec6070

* Add expected find_package definitions * Expose both C and C++ Peano compilers * Remove unused property * Relocating kernels * Output kernels for a device in a directory * Explicit names for Peano compilation

Stand-alone kernel discovery header (#40)

fd4670b

* Python script fixes * Separating kernel discovery to its own header

Renaming kernel to the IRON equivalent (#41)

e7c9269

Fix CPU emulation ops (#42)

5cd2849

* Remove conservative asserts * Removing cpy kernel. Delegating to the CPU device for supports_op * Extracting types in example * Create completely independent fallback graph * Correct source tensor iteration * Better messages * Caching emulated tensors

ypapadop-amd and others added 10 commits February 27, 2026 13:21

Fix circular buffer overrun (#166)

81d5e73

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Deferred compilation (#167)

823c8fc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ci: add auto-format workflow for ggml-hsa directory (#168)

8805b6c

Fallback to data memory if kernarg is not available (#169)

97678cc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add scalar ARGMAX kernel for HSA backend (#171)

10b9e33

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

feat(ggml-hsa): add COUNT_EQUAL operation kernel (#173)

3f6a774

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Implement CROSS_ENTROPY_LOSS, and improve vec_exp in SOFT_MAX

71a20bd

Auto-format code in src/ggml-hsa

56a41cb

- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Vector op broadcast (#174)

aa3df92

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

Code documentation (#176)

946dc10

ypapadop-amd self-assigned this Mar 9, 2026

Copilot AI review requested due to automatic review settings March 9, 2026 20:11

Copilot started reviewing on behalf of ypapadop-amd March 9, 2026 20:12 View session

ypapadop-amd linked an issue Mar 9, 2026 that may be closed by this pull request

MNIST inference on NPU #119

Open

Copilot AI reviewed Mar 9, 2026

View reviewed changes

feat(hsa): add null pointer assertions in AIE kernel dispatch (#178)

9737fac

ypapadop-amd force-pushed the ypapadop-amd/mnist-layers branch from d6ba8d1 to 8dc17cb Compare March 10, 2026 16:05

ypapadop-amd and others added 5 commits March 10, 2026 12:10

refactor(hsa): move scalar math functions to shared header

ece5579

feat(hsa): move pow2, floor_log2 and alibi_slope math functions

b448427

Auto-format code in src/ggml-hsa

69f9da3

- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

feat(ggml-hsa): add scalar f32 matmul support for AIE2/AIE2p

ed6e5d3

f32 support in MUL_MAT

8dc17cb

ypapadop-amd marked this pull request as draft March 12, 2026 20:00

ypapadop-amd force-pushed the hsa-backend branch from 982bd89 to 43e18d6 Compare March 18, 2026 19:33

ypapadop-amd force-pushed the hsa-backend branch 2 times, most recently from b68364f to ec7af98 Compare April 4, 2026 14:30

ypapadop-amd force-pushed the hsa-backend branch from ec7af98 to 9e4abe6 Compare April 16, 2026 12:42

ypapadop-amd force-pushed the hsa-backend branch 2 times, most recently from 0e56e13 to 2ef36f6 Compare May 19, 2026 16:24

ypapadop-amd force-pushed the hsa-backend branch from 723bd26 to 95d9a71 Compare June 26, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layers for MNIST example#177

Layers for MNIST example#177
ypapadop-amd wants to merge 141 commits into
hsa-backendfrom
ypapadop-amd/mnist-layers

ypapadop-amd commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ypapadop-amd commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants