diff --git a/docs/source/jit_lto_guide.md b/docs/source/jit_lto_guide.md index 490bec5914..4e25322636 100644 --- a/docs/source/jit_lto_guide.md +++ b/docs/source/jit_lto_guide.md @@ -475,86 +475,37 @@ extern "C" __global__ void search_kernel( **Note**: The kernel uses generic function templates (`compute_distance` and `apply_filter`) that are resolved at link time. The specific implementations (euclidean vs inner_product, filter_none vs filter_bitset) are provided by the fragments that get linked together. -### Step 5: Create `.cpp.in` Template Files for Embedding +### Step 5: Create Fragment Tags for Embedding -The `.cpp.in` files register the compiled fatbins so they can be loaded at runtime. Fragment tags are used to help the -linker find and include the relevant fatbins at build time. +Fragment tags register the compiled fatbins so they can be loaded at runtime. They are used to help the linker find and include the relevant fatbins at build time. When calling `generate_jit_lto_kernels()`, we pass a `FRAGMENT_TAG_FORMAT` argument, which constructs the tag type from the given placeholders, and a `FRAGMENT_TAG_HEADER_FILES` argument, which specifies one or more header files that the fragment tags come from. The JIT+LTO system will then automatically generate and compile a `.cpp` file that registers the fragment using the provided tag. -**Important**: In the `.cpp.in` files (which become `.cpp` files), we use **tags** (like `tag_f`, `tag_h`) instead of real types (like `float`, `__half`) in the `StaticFatbinFragmentEntry` template parameters. This avoids including heavy headers that define the actual types, significantly improving compilation times. The tags are lightweight empty structs that serve only as compile-time identifiers. +**Important**: When requesting fragments from the `AlgorithmPlanner`, we use **tags** (like `tag_f`, `tag_h`) instead of real types (like `float`, `__half`) in the `add_static_fragment` template parameters. This avoids including heavy headers that define the actual types, significantly improving compilation times. The tags are lightweight empty structs that serve only as compile-time identifiers. -#### `compute_distance_embedded.cpp.in` +**`registration_tags.hpp`** -```text -/* - * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. - * SPDX-License-Identifier: Apache-2.0 - */ - -#include -#include -#include "@embedded_header_file@" - -namespace example::detail { - -using _FragmentEntry = StaticFatbinFragmentEntry>; - -template <> -const uint8_t* const _FragmentEntry::data = embedded_fatbin; - -template <> -const size_t _FragmentEntry::length = embedded_fatbin; - -} -``` - -#### `filter_embedded.cpp.in` - -```text -/* - * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. - * SPDX-License-Identifier: Apache-2.0 - */ - -#include -#include -#include "@embedded_header_file@" - -namespace example::detail { - -using _FragmentEntry = StaticFatbinFragmentEntry>; - -template <> -const uint8_t* const _FragmentEntry::data = embedded_fatbin; - -template <> -const size_t _FragmentEntry::length = embedded_fatbin; - -} -``` - -#### `search_kernel_embedded.cpp.in` - -```text -/* - * SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. - * SPDX-License-Identifier: Apache-2.0 - */ +```cpp +#pragma once -#include -#include -#include "@embedded_header_file@" +struct tag_h{}; +struct tag_f{}; +struct tag_d{}; +struct tag_ui{}; +struct tag_l{}; -namespace example::detail { +struct tag_metric_euclidean {}; +struct tag_metric_inner_product {}; -using _FragmentEntry = StaticFatbinFragmentEntry>; +struct tag_filter_none {}; +struct tag_filter_bitset {}; -template <> -const uint8_t* const _FragmentEntry::data = embedded_fatbin; +template +struct fragment_tag_search {}; -template <> -const size_t _FragmentEntry::length = embedded_fatbin; +template +struct fragment_tag_compute_distance {}; -} +template +struct fragment_tag_filter {}; ``` ### Step 6: Create the Planner @@ -565,7 +516,7 @@ The planner is responsible for: 3. Requesting the fragments from the fragment database 4. Linking them together to create a launchable kernel -**CRITICAL**: The fragment keys constructed in the planner methods must match **EXACTLY** with the keys used in the corresponding `.cpp.in` registration files. Any mismatch will result in runtime linking failures. +**CRITICAL**: The fragment keys constructed in the planner methods must match **EXACTLY** with the keys used in the corresponding `FRAGMENT_TAG_FORMAT` argument. Any mismatch will result in runtime linking failures. **`search_planner.hpp`**: @@ -642,11 +593,13 @@ constexpr auto get_out_type_tag() { template constexpr auto get_metric_tag() { if constexpr (Metric == DistanceType::Euclidean) return tag_metric_euclidean{}; + if constexpr (Metric == DistanceType::InnerProduct) return tag_metric_inner_product{}; } template constexpr auto get_filter_tag() { if constexpr (Filter == FilterType::None) return tag_filter_none{}; + if constexpr (Filter == FilterType::Bitset) return tag_filter_bitset{}; } template @@ -755,8 +708,6 @@ struct tag_l {}; // int64_t These tags are used in `registerAlgorithm<>()` to create a hierarchical organization of fragments. -**Why Tags Instead of Real Types?**: Using tags instead of real types (like `float`, `__half`) in the `.cpp.in` files avoids including heavy headers that define those types. This significantly improves compilation times since the generated `.cpp` files don't need to pull in CUDA headers, type definitions, or other dependencies. Tags are lightweight compile-time identifiers that don't require any runtime overhead or additional includes. - ### AlgorithmLauncher The `AlgorithmLauncher` is the runtime handle for a linked kernel. It: @@ -770,7 +721,7 @@ The `AlgorithmLauncher` is the runtime handle for a linked kernel. It: 2. **Fragment Granularity**: Balance between too many small fragments (overhead) and too few large fragments (less reuse). Device functions that are reused across multiple kernels are good candidates for separate fragments. -3. **Naming Consistency**: Ensure fragment keys match exactly between registration and lookup. Use helper functions to construct keys consistently. +3. **Naming Consistency**: Ensure fragment tags match exactly between registration and lookup. Use helper functions to construct tags consistently. 4. **Type Safety**: Use registration tags to provide compile-time type safety and avoid runtime string mismatches. @@ -794,7 +745,8 @@ The `generate_jit_lto_kernels()` function (defined in `cmake/modules/generate_ji - `NAME_FORMAT`: Format string for generated kernel names (using `@variable@` syntax) - `MATRIX_JSON_FILE`: Path to the JSON matrix file - `KERNEL_INPUT_FILE`: Path to the `.cu.in` template -- `EMBEDDED_INPUT_FILE`: Path to the `.cpp.in` template +- `FRAGMENT_TAG_FORMAT`: Format string for fragment tag type (using `@variable@` syntax) +- `FRAGMENT_TAG_HEADER_FILES`: List of header files that provide the fragment tag types (can be enclosed in `<`/`>` or `"`/`"`, automatically enclosed in quotes if quotes and brackets are not provided) - `OUTPUT_DIRECTORY`: Where generated files are placed - `KERNEL_LINK_LIBRARIES`: Interface library with compilation settings @@ -814,7 +766,7 @@ The process involves: 1. Separating device functions into fragment headers 2. Creating JSON matrices defining parameter combinations 3. Creating `.cu.in` templates for explicit instantiations -4. Creating `.cpp.in` templates for fatbin registration +4. Creating fragment tag types for fatbin registration 5. Creating a planner to manage fragment dependencies 6. Integrating the planner into the code path to launch kernels 7. **Adding CMake integration** to generate and compile all fragment variants