feat(hip-kernel-provider): add rocKE conv engine with ML heuristic#8982
feat(hip-kernel-provider): add rocKE conv engine with ML heuristic#8982cderb wants to merge 10 commits into
Conversation
… compilation Adds a new hipDNN engine plugin for grouped convolution forward pass using the rocKE implicit GEMM framework. The engine selects tile configurations via a LightGBM ML heuristic and compiles kernels at plan-build time by lowering rocKE IR to HSACO. ## C++ engine (src/engines/rocke_conv_engine/) - RockeConvEngine: registers the engine with the plugin, handles isApplicable checks (fp16, rank-4, gfx942/gfx950/gfx90a, model file present). - ConvFwdPlanBuilder: extracts the conv problem from the hipDNN op-graph (NCHW logical dim order), enumerates tile candidates, scores them with the ML heuristic, lowers the winning spec to LLVM IR via rocKE, patches the IR for LLVM 23 / ROCm 7.14 compatibility, and compiles to HSACO via direct `clang -x ir` invocation. - ConvFwdPlan: stores the compiled HIP module/function and kernel launch params; executes by binding tensor pointers from the workspace map and dispatching hipModuleLaunchKernel. Key implementation details: - gfx942 uses warp_tile_k=8 (32x32x8 MFMA atom for f16); gfx950/gfx90a use 16. - LightGBM symbols declared weak so the plugin loads without liblgbm.so at link time; falls back to first valid tile config if the model is not loaded. - IR patching (patchMakeBufferRsrc): normalises the llvm.amdgcn.make.buffer.rsrc intrinsic across rocKE LLVM20/22 output flavors to the form accepted by the ROCm 7.14 container clang 23 build (.p8.p1, i64 num_records, no parameter attributes). Injects zext instructions in the kernel entry block to widen i32 byte-count params to i64 at call sites without breaking the kernel ABI. - hipRTC/comgr bypassed entirely: comgr's internal IR auto-upgrade pass mangles ptr addrspace(N) intrinsic arguments, causing verifier failures. Direct clang invocation avoids this. - Tensor dims read in NCHW logical order ([N,C,Hi,Wi] / [K,C,Y,X]) as required by the hipDNN frontend, with NHWC-contiguous strides set separately. ## ML heuristic (rocKE/Cpp/include/rocke/conv_ml_heuristic.h) New C++ header wrapping the LGBM C API for conv tile-config scoring. Declares LGBM symbols as weak externals so the engine plugin loads in environments where liblgbm.so is absent. ## Python heuristics (rocKE/Python/rocke/heuristics/) - feature_engine_grouped_conv.py: extended feature set (101 → 107 features) including log-space geometric features, CU occupancy ratios, and L2/memory pressure proxies. - gen_conv_sweep_data.py: sweep data generator for grouped conv; runs inside Enroot containers on Slurm GPU nodes. - augment_coverage_conv.py: targeted OOF-driven shape generator to fill coverage gaps in the training distribution. - generate_coverage_conv.py: coverage analysis and shape generation utilities. - sample_shapes_conv.py: random shape sampler with architecture-aware filtering. - validate_ml_vs_oracle_conv.py: ML vs oracle comparison for conv predictions. - train.py: updated to write feature_spec.json alongside trained models. ## Trained models (rocKE/Python/rocke/heuristics/models/) Initial model checkpoints for gfx942, gfx950, gfx90a (grouped conv fwd fp16): - model_tflops.lgbm.gz: compressed LightGBM booster (gunzip before use) - feature_spec.json: ordered feature name list consumed by the C++ heuristic - train_manifest.json: training provenance metadata ## CMake wiring - ENABLE_ROCKE_CONV_ENGINE option (default OFF) guards the new engine. - rocke_core (the rocKE C library) built as a subproject; linked into hip_kernel_provider_impl via TARGET_OBJECTS propagation. - Integration test target hip_kernel_provider_integration_tests extended with four conv forward smoke tests (3x3_small, 1x1_pointwise, 3x3_stride2, 3x3_rect_spatial) covering correctness on gfx942. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add per-candidate process isolation to the conv candidate sweep via fork()+exec() of a helper binary (rocke_kern_time), eliminating GPU context poisoning from hung kernels. Pre-launch validation via hipFuncGetAttribute catches resource-limit failures before launching. Replace gen_conv_sweep_data.py (slow Python-only sweep) with a C++ sweep wrapper that maintains the same CLI interface (--shapes, --shape-set, --arch, --max-shapes) and generate() API for gen_sweep_data.py dispatch. Conv engine integration: hipRTC linker API replaces popen(clang), ConvHwProfile uses live hipDeviceProp_t, int64 byte-size computation. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…conv-models # Conflicts: # projects/hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp
❌ PR Check — Action Required
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🎉 All checks passed! This PR is ready for review. |
…spatcher tests Run black 25.12.0 and clang-format 18 on all changed files to satisfy pre-commit. Add gfx90a unit tests to test_conv.py for dispatcher coverage. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add trailing newlines to model JSON files and remove extra blank line in rocke_conv_engine/CMakeLists.txt. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…istics - Add missing stride_w/pad_w to validation results (validate_ml_vs_oracle_conv) - Convert sweep latency_us to latency_ms to match training pipeline (gen_conv_sweep_data) - Cache ConvMLHeuristic across buildPlan calls to avoid reloading model from disk - Remove dead depthwise coverage section (C/G=1 fails MFMA alignment constraint) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Pre-commit check failed⛔ pre-commit failed Please run locally:
This repo uses |
|
This work is very cool and usefully shows a bunch of the pieces that we can build on and learn from. The base library is still in a lot of flux and likely things move around that breaks this PR and we are trying to focus on SDPA so we may delay landing convs until we are sure about how we organize the SDPA work. but all the pieces of the conv work are coming together here nicely and we will land some version of this in the next weeks! |
…ic tooling Simplify ConvFwdParams to only fields needed at execute time (UIDs, byte sizes, grid/block, kernel name), pre-compute byte sizes at build time. Eliminate two heap allocations per predict_tflops call via pre-allocated member buffers. Replace mkstemp with memfd_create and busy-wait with blocking waitpid+alarm in sweep tool. Consolidate duplicated Python helpers (HEADER, SHAPE_COLS, bucket functions, write_csv) into canonical locations. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…n changes Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (76.92%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #8982 +/- ##
========================================
Coverage 71.33% 71.33%
========================================
Files 2628 2628
Lines 413043 413043
Branches 61875 61875
========================================
+ Hits 294613 294617 +4
+ Misses 96656 96653 -3
+ Partials 21774 21773 -1
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
…conv-models # Conflicts: # dnn-providers/hip-kernel-provider/rocke/platform/Cpp/include/rocke/conv_ml_heuristic.h # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/augment_coverage_conv.py # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/gen_conv_sweep_data.py # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/gen_sweep_data.py # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/generate_coverage_conv.py # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/feature_spec.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/model_tflops.lgbm.gz # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/train_manifest.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/feature_spec.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/model_tflops.lgbm.gz # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/train_manifest.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/feature_spec.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/model_tflops.lgbm.gz # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/train_manifest.json # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/sample_shapes_conv.py # dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/validate_ml_vs_oracle_conv.py
…alid refs
all_buf_{kConvFeatureCount} used brace-init which selects the
initializer_list<double> ctor, creating a 1-element vector (value 109.0)
instead of a 109-element buffer — heap overflow on every predict_tflops
call. Use explicit construction instead.
Also fix two _valid() call sites in generate_coverage_conv.py missed
during the _valid → conv_shape_valid rename.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Summary
Adds the rocKE conv forward engine to hip-kernel-provider, providing JIT-compiled implicit-GEMM convolution kernels selected by a LightGBM ML heuristic. The engine compiles IR to HSACO at runtime via the hipRTC linker API and selects the best kernel candidate per-shape using a trained tflops prediction model. Ships trained models for gfx90a, gfx942, and gfx950. Also includes C++ sweep tooling with fork+exec per-candidate GPU isolation for robust training data collection.
JIRA ID : AICK-1533
Conv model stats (5-fold grouped CV, 2000 LightGBM estimators)
Risk Assessment
Medium risk. This adds a new opt-in engine (
ENABLE_ROCKE_CONV_ENGINE=ON, off by default) with JIT compilation, ML model loading, and a new hipRTC linker code path. The engine is behind a build flag and does not affect existing engines or default behavior. Integration tests cover correctness on gfx90a and gfx942; gfx950 coverage is pending.ASIC Coverage
Specific-ASIC runs required on gfx90a, gfx942, and gfx950. The engine ships arch-specific ML models and generates arch-specific IR; each target architecture must pass integration tests independently. The engine is gated by
ENABLE_ROCKE_CONV_ENGINEand does not affect other engines, so no full multi-arch sweep of unrelated components is needed.Testing Summary
IntegrationGpuRockeConvFwdFp16) validate end-to-end: model load → feature extraction → LightGBM inference → kernel selection → IR compile → dispatch → numerical correctness vs CPU reference.test_conv.py) validate candidate selection, arch gating, and support surface checks (CPU-only).Testing Checklist
--gtest_filter="*RockeConv*"- ASICs: gfx90a - Status: Passed--gtest_filter="*RockeConv*"- ASICs: gfx942 - Status: Passed--gtest_filter="*RockeConv*"- ASICs: gfx950 - Status: Pendingtest_conv.py- Status: PassedTechnical Changes
RockeConvEnginehipdnn engine plugin: receives conv op-graph, extracts NHWC problem params, delegates toConvFwdPlanBuilderfor JIT kernel compilation and dispatch.ConvFwdPlanBuilder: queriesConvMLHeuristicfor top-K candidate kernels, compiles IR → HSACO viahiprtcLinkCreate/hiprtcLinkAddData/hiprtcLinkComplete, returns executableConvFwdPlan.ConvMLHeuristic(C++): loads.lgbmmodel andfeature_spec.json, extracts features from conv problem + hardware profile (hipDeviceProp_t), predicts tflops per candidate.ROCKE_CONV_ENGINEinEngineNames.hppalongside existingROCKE_ENGINE.conv_candidate_sweep.cpp,rocke_kern_time.cpp) with fork+exec isolation per candidate (5s timeout, SIGKILL on hang) for training data collection.gen_conv_sweep_data.pywrapping the C++ sweep binary; supports--shapesCSV input for targeted coverage augmentation..lgbm.gz) andfeature_spec.jsonfor gfx90a (101 features), gfx942 (101 features), and gfx950 (72 features).IntegrationGpuRockeConvFwdFp16integration test with 4 smoke shapes (3×3, 1×1, strided, rectangular).