Layers for MNIST example#177
Conversation
* CMake support for HSA backend * Stubs for initializing HSA * Stubs for HSA buffers * Stubs for HSA and host buffers * Using new backend CMake declaration * Additional stubs for HSA backend * Formatting * Adding function to track unimplemented APIs
* Identifying memory pools * Support for buffer type alignment and max size * Cache memory properties * Comments * Using fixed-width integrals * Buffer allocation support
* Adding HSA backend to examples/simple-backend * Adding HSA backend to backend registration * Adding Eigen as a temporary matrix mulmat implementation * Support for device type reporting * Support for free and total memory reporting * Properly reporting which kernels are supported
* More function implementations, cleanups * Remove redundant information, catch exception
* Correcting comment * Add description from agent name * Implementation of backend get_device_description * Comments * Adding cpu backend as fallback for all ops * Marking which functions can be improved and correct guid * Remove needs-implementation marker on more functions * Hide cpu backend internally * Remove extra header
* HSA backend in test-mul-mat * HSA backend in gpt-2-backend * Offloading to CPU backend if operation not implemented
* Creating HSA queue * Zero-init all members * Adding signal support
* Add HSA backend to GPT-2 example * Remove CPU backend from HSA * Returning that it is host buffer for NPU memory * Adding CPY kernels and factoring out kernel code * Formatting, comments * Temporary storage for cpy * Extracting supports_op conditions * Renaming function
* Add option for CPU fallback in CMakeLists * Adding fallback to CPU backend if operation is not supported
* Add operation example * Using tensor count variable * Count source tensors and copy name * Detect if execution failed * Switching test to int32_t * add kernel using XRT * Aligning example size with kernel * Adding dev heap pool * Using HSA in add kernel * Using relaxed write to queue * Remove XRT dependency * Size independent test * Correct elements for kernel * Moving load functions to common.h * Using simplified AIE packet * Moving loading to a kernel registry * Adding constructor * Add kernel * Refactoring add script * Single name for PDI and instr.txt * Refactoring * Generalizing add.py * Adding dims * Comments and error checking * Modularize python script * CMake kernel generation * Remove magic numbers and use GGML data type naming * Adding a structure for NPU kernels and free function * Accepting only contiguous tensors for now * Stub for keeping loaded kernel in context * Passing device info as parameter, renaming contexts for easy filtering * Renaming variables * Reworking example * Using registry of kernels * Using HSA agent name for kernels * Using dladdr to get the kernel directory
* Using static instead of anonymous namespace * Handling exceptions
* clang-format configuration loosely based on ggml-sycl * Formatting
* Comments, disabling copy/move when not allowed * Replacing high / low bits macros * Factoring out dispatch functionality * Free all finished packets
* Missing checks in example * Adding init_tensor support
* Vector add for floats * Handling higher dims upon load * Move tensor testing in the operation supports function * mul_mat kernel compilation * Smaller gemm * Renaming args * Smaller gemm * Copy instead of moving PDI * Unify cmake kernel generation functions * Using latest CMake Python integration * Missing CMake HSA integration * Install kernels * Adding missing dependencies * Updating test to use HSA conditionally. HSA-specific mul_mat test * Encoding all dims in kernel filename
* Using new compilation process in CMake * Loading insts as binary
* Renaming dispatch function * Track allocated memory for packets
* Avoid warnings * Adding extra data to HSA backend tensors * Caching kernel in tensor extra metadata
* Renaming pending data functions. Refactor packet dispatch * Guard all CPU fallback cases * Internal nodes do not init extra until after graph allocation * Assert cleanups * Comments * Separate CMake support
* Add expected find_package definitions * Expose both C and C++ Peano compilers * Remove unused property * Relocating kernels * Output kernels for a device in a directory * Explicit names for Peano compilation
* Python script fixes * Separating kernel discovery to its own header
* Remove conservative asserts * Removing cpy kernel. Delegating to the CPU device for supports_op * Extracting types in example * Create completely independent fallback graph * Correct source tensor iteration * Better messages * Caching emulated tensors
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors the HSA backend AIE kernels to consolidate shared math utilities (scalar_exp, scalar_log, pow2, floor_log2, compute_alibi_slope, and vec_exp) into a single shared header (aie_kernel_math.h), and removes the alignment restriction for the cross-entropy loss kernel by replacing the strict "must be divisible by KERN_VEC_SIZE" check with scalar tail loops that handle the remaining elements.
Changes:
- Shared math utility functions moved into
aie_kernel_math.hwith#pragma onceguard, replacing per-file inline definitions and the old#ifndef _AIE_KERNEL_MATH_guard. - Cross-entropy loss kernel updated to support non-aligned row sizes via scalar tail loops in all three passes (find max, accumulate sum_exp, accumulate loss).
- Python orchestration for cross-entropy loss simplified: alignment validation and the
COMPILE_GGML_OP_CROSS_ENTROPY_LOSScompile flag removed.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/ggml-hsa/kernels/iron/aie_kernel_math.h |
Converts to #pragma once, adds scalar_exp/scalar_log, and consolidates pow2, floor_log2, compute_alibi_slope previously defined in softmax.cc |
src/ggml-hsa/kernels/iron/softmax.cc |
Removes the now-shared pow2, floor_log2, and compute_alibi_slope functions (moved to aie_kernel_math.h) |
src/ggml-hsa/kernels/iron/cross_entropy_loss.cc |
Removes scalar_log (moved to aie_kernel_math.h) and adds scalar tail loops to handle non-aligned sizes; removes the #ifdef COMPILE_GGML_OP_CROSS_ENTROPY_LOSS guard |
src/ggml-hsa/kernels/iron/cross_entropy_loss.py |
Removes alignment check and the COMPILE_GGML_OP_CROSS_ENTROPY_LOSS compile flag; tile_size is now simply row_length |
You can also share your feedback on Copilot code review. Take the survey.
| constexpr int NUM_POW2_COEFFS = sizeof(pow2_coeffs) / sizeof(pow2_coeffs[0]); | ||
|
|
||
| // compute 2^f using Horner's method for Taylor series of exp(f * ln(2)) | ||
| float exp_f = pow2_coeffs[0]; | ||
|
|
||
| #pragma unroll | ||
| for (int j = 1; j < NUM_POW2_COEFFS; ++j) { |
There was a problem hiding this comment.
In the pow2 function, NUM_POW2_COEFFS is declared as constexpr int (line 237) and the loop variable is int (line 243). The rest of this PR consistently uses int32_t for local integer variables (e.g., NUM_EXP_COEFFS on line 168 and loop variables in cross_entropy_loss.cc). These should be changed to int32_t for consistency with the surrounding new code.
| float m1 = pow2(-(max_bias / 2.0f) / n_head_log2); | ||
|
|
||
| float slope; | ||
| if (head_idx < n_head_log2) { |
There was a problem hiding this comment.
In compute_alibi_slope, line 304 compares head_idx (type int32_t) with n_head_log2 (type uint32_t), which is a signed/unsigned comparison. If head_idx could ever be negative (it's passed as a int32_t), the comparison would produce incorrect results due to implicit conversion. Consider casting head_idx to uint32_t before the comparison, or document that a non-negative value is required and add an assertion.
| #include "aie_kernel_math.h" | ||
| #include "ggml-aie.hpp" |
There was a problem hiding this comment.
In cross_entropy_loss.cc, #include "ggml-aie.hpp" (line 3) is now redundant because aie_kernel_math.h (line 4) already includes it via its own #include "ggml-aie.hpp". While harmless due to #pragma once, the redundant include could be removed to keep the include list clean.
| # Tile size equals row length; the kernel handles non-aligned sizes | ||
| # via scalar tail loops after vectorized processing. | ||
| tile_size = row_length |
There was a problem hiding this comment.
The PR title is "Layers for MNIST example", but the actual code changes are all in HSA kernel infrastructure files (src/ggml-hsa/kernels/iron/): refactoring shared math utilities into aie_kernel_math.h and enabling non-aligned row sizes in the cross-entropy loss kernel. There is no visible connection to the MNIST example in the diff. If the PR description is correct about the purpose, there may be missing changes (e.g., actual MNIST layer additions), or the PR title may be inaccurate.
| int32_t bits = reinterpret_cast<int32_t &>(x); | ||
| int32_t e_int = ((bits >> 23) & 0xFF) - 127; | ||
| float e = static_cast<float>(e_int); | ||
|
|
||
| int32_t m_bits = (bits & 0x007FFFFF) | 0x3F800000; | ||
| float m = reinterpret_cast<float &>(m_bits); |
There was a problem hiding this comment.
In scalar_log (lines 71, 76) and pow2 (line 252), type punning is performed using reinterpret_cast<T&>, which is undefined behavior in standard C++. By contrast, the newly added scalar_exp function in the same file correctly uses std::memcpy (line 50) for the equivalent operation. All three type-punning sites should use std::memcpy for consistency and correctness.
| // create the integer representation of 2^i | ||
| int32_t bits = (127 + i) << 23; | ||
| // cast the bits directly to float | ||
| float scale = reinterpret_cast<float &>(bits); |
There was a problem hiding this comment.
In pow2, the reinterpret_cast<float &>(bits) at line 252 is undefined behavior in standard C++. The newly added scalar_exp function in the same header uses std::memcpy for the equivalent bit-reinterpretation at line 50. This type-punning site should also be replaced with std::memcpy for safety and consistency.
d6ba8d1 to
8dc17cb
Compare
- Format C++ code with clang-format - Format Python code with black Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
982bd89 to
43e18d6
Compare
b68364f to
ec7af98
Compare
ec7af98 to
9e4abe6
Compare
0e56e13 to
2ef36f6
Compare
723bd26 to
95d9a71
Compare
No description provided.