Skip to content

feat(hip-kernel-provider): add rocKE conv engine with ML heuristic#8982

Open
cderb wants to merge 10 commits into
developfrom
users/cderb/rocke-conv-models
Open

feat(hip-kernel-provider): add rocKE conv engine with ML heuristic#8982
cderb wants to merge 10 commits into
developfrom
users/cderb/rocke-conv-models

Conversation

@cderb

@cderb cderb commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the rocKE conv forward engine to hip-kernel-provider, providing JIT-compiled implicit-GEMM convolution kernels selected by a LightGBM ML heuristic. The engine compiles IR to HSACO at runtime via the hipRTC linker API and selects the best kernel candidate per-shape using a trained tflops prediction model. Ships trained models for gfx90a, gfx942, and gfx950. Also includes C++ sweep tooling with fork+exec per-candidate GPU isolation for robust training data collection.

JIRA ID : AICK-1533

Conv model stats (5-fold grouped CV, 2000 LightGBM estimators)

Arch Shapes Rows Features CV Mean Eff CV P10 Eff CV R²
gfx90a 1,654 104,504 101 0.990 0.970 0.998
gfx942 2,621 26,531 101 0.991 0.972 0.998
gfx950 8,957 621,002 72 0.950 0.903 0.966

Risk Assessment

Medium risk. This adds a new opt-in engine (ENABLE_ROCKE_CONV_ENGINE=ON, off by default) with JIT compilation, ML model loading, and a new hipRTC linker code path. The engine is behind a build flag and does not affect existing engines or default behavior. Integration tests cover correctness on gfx90a and gfx942; gfx950 coverage is pending.

ASIC Coverage

Specific-ASIC runs required on gfx90a, gfx942, and gfx950. The engine ships arch-specific ML models and generates arch-specific IR; each target architecture must pass integration tests independently. The engine is gated by ENABLE_ROCKE_CONV_ENGINE and does not affect other engines, so no full multi-arch sweep of unrelated components is needed.

Testing Summary

  • C++ integration tests (IntegrationGpuRockeConvFwdFp16) validate end-to-end: model load → feature extraction → LightGBM inference → kernel selection → IR compile → dispatch → numerical correctness vs CPU reference.
  • Python dispatcher unit tests (test_conv.py) validate candidate selection, arch gating, and support surface checks (CPU-only).

Testing Checklist

  • C++ integration tests - --gtest_filter="*RockeConv*" - ASICs: gfx90a - Status: Passed
  • C++ integration tests - --gtest_filter="*RockeConv*" - ASICs: gfx942 - Status: Passed
  • C++ integration tests - --gtest_filter="*RockeConv*" - ASICs: gfx950 - Status: Pending
  • Python dispatcher unit tests - test_conv.py - Status: Passed
  • PR CI - GitHub PR checks - Status: Pending

Technical Changes

  • Adds RockeConvEngine hipdnn engine plugin: receives conv op-graph, extracts NHWC problem params, delegates to ConvFwdPlanBuilder for JIT kernel compilation and dispatch.
  • Adds ConvFwdPlanBuilder: queries ConvMLHeuristic for top-K candidate kernels, compiles IR → HSACO via hiprtcLinkCreate/hiprtcLinkAddData/hiprtcLinkComplete, returns executable ConvFwdPlan.
  • Adds ConvMLHeuristic (C++): loads .lgbm model and feature_spec.json, extracts features from conv problem + hardware profile (hipDeviceProp_t), predicts tflops per candidate.
  • Registers ROCKE_CONV_ENGINE in EngineNames.hpp alongside existing ROCKE_ENGINE.
  • Adds C++ sweep tooling (conv_candidate_sweep.cpp, rocke_kern_time.cpp) with fork+exec isolation per candidate (5s timeout, SIGKILL on hang) for training data collection.
  • Replaces old Python-only conv sweep with gen_conv_sweep_data.py wrapping the C++ sweep binary; supports --shapes CSV input for targeted coverage augmentation.
  • Ships compressed LightGBM models (.lgbm.gz) and feature_spec.json for gfx90a (101 features), gfx942 (101 features), and gfx950 (72 features).
  • Adds IntegrationGpuRockeConvFwdFp16 integration test with 4 smoke shapes (3×3, 1×1, strided, rectangular).

cderb and others added 3 commits June 27, 2026 01:58
… compilation

Adds a new hipDNN engine plugin for grouped convolution forward pass using the
rocKE implicit GEMM framework. The engine selects tile configurations via a
LightGBM ML heuristic and compiles kernels at plan-build time by lowering rocKE
IR to HSACO.

## C++ engine (src/engines/rocke_conv_engine/)

- RockeConvEngine: registers the engine with the plugin, handles isApplicable
  checks (fp16, rank-4, gfx942/gfx950/gfx90a, model file present).
- ConvFwdPlanBuilder: extracts the conv problem from the hipDNN op-graph
  (NCHW logical dim order), enumerates tile candidates, scores them with the
  ML heuristic, lowers the winning spec to LLVM IR via rocKE, patches the IR
  for LLVM 23 / ROCm 7.14 compatibility, and compiles to HSACO via direct
  `clang -x ir` invocation.
- ConvFwdPlan: stores the compiled HIP module/function and kernel launch params;
  executes by binding tensor pointers from the workspace map and dispatching
  hipModuleLaunchKernel.

Key implementation details:
- gfx942 uses warp_tile_k=8 (32x32x8 MFMA atom for f16); gfx950/gfx90a use 16.
- LightGBM symbols declared weak so the plugin loads without liblgbm.so at link
  time; falls back to first valid tile config if the model is not loaded.
- IR patching (patchMakeBufferRsrc): normalises the llvm.amdgcn.make.buffer.rsrc
  intrinsic across rocKE LLVM20/22 output flavors to the form accepted by the
  ROCm 7.14 container clang 23 build (.p8.p1, i64 num_records, no parameter
  attributes). Injects zext instructions in the kernel entry block to widen
  i32 byte-count params to i64 at call sites without breaking the kernel ABI.
- hipRTC/comgr bypassed entirely: comgr's internal IR auto-upgrade pass mangles
  ptr addrspace(N) intrinsic arguments, causing verifier failures. Direct clang
  invocation avoids this.
- Tensor dims read in NCHW logical order ([N,C,Hi,Wi] / [K,C,Y,X]) as required
  by the hipDNN frontend, with NHWC-contiguous strides set separately.

## ML heuristic (rocKE/Cpp/include/rocke/conv_ml_heuristic.h)

New C++ header wrapping the LGBM C API for conv tile-config scoring. Declares
LGBM symbols as weak externals so the engine plugin loads in environments where
liblgbm.so is absent.

## Python heuristics (rocKE/Python/rocke/heuristics/)

- feature_engine_grouped_conv.py: extended feature set (101 → 107 features)
  including log-space geometric features, CU occupancy ratios, and L2/memory
  pressure proxies.
- gen_conv_sweep_data.py: sweep data generator for grouped conv; runs inside
  Enroot containers on Slurm GPU nodes.
- augment_coverage_conv.py: targeted OOF-driven shape generator to fill coverage
  gaps in the training distribution.
- generate_coverage_conv.py: coverage analysis and shape generation utilities.
- sample_shapes_conv.py: random shape sampler with architecture-aware filtering.
- validate_ml_vs_oracle_conv.py: ML vs oracle comparison for conv predictions.
- train.py: updated to write feature_spec.json alongside trained models.

## Trained models (rocKE/Python/rocke/heuristics/models/)

Initial model checkpoints for gfx942, gfx950, gfx90a (grouped conv fwd fp16):
- model_tflops.lgbm.gz: compressed LightGBM booster (gunzip before use)
- feature_spec.json: ordered feature name list consumed by the C++ heuristic
- train_manifest.json: training provenance metadata

## CMake wiring

- ENABLE_ROCKE_CONV_ENGINE option (default OFF) guards the new engine.
- rocke_core (the rocKE C library) built as a subproject; linked into
  hip_kernel_provider_impl via TARGET_OBJECTS propagation.
- Integration test target hip_kernel_provider_integration_tests extended with
  four conv forward smoke tests (3x3_small, 1x1_pointwise, 3x3_stride2,
  3x3_rect_spatial) covering correctness on gfx942.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add per-candidate process isolation to the conv candidate sweep via
fork()+exec() of a helper binary (rocke_kern_time), eliminating GPU
context poisoning from hung kernels. Pre-launch validation via
hipFuncGetAttribute catches resource-limit failures before launching.

Replace gen_conv_sweep_data.py (slow Python-only sweep) with a C++
sweep wrapper that maintains the same CLI interface (--shapes, --shape-set,
--arch, --max-shapes) and generate() API for gen_sweep_data.py dispatch.

Conv engine integration: hipRTC linker API replaces popen(clang),
ConvHwProfile uses live hipDeviceProp_t, int64 byte-size computation.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…conv-models

# Conflicts:
#	projects/hipdnn/data_sdk/include/hipdnn_data_sdk/utilities/EngineNames.hpp
@therock-pr-bot

therock-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

❌ PR Check — Action Required

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ✅ Pass
Forbidden Files ✅ Pass
🧪 Unit Test ✅ Pass
🔎 pre-commit ❌ Fail Error: Check concluded with failure.
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled

⚠️ 1 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

  • ❌ pre-commit

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot

therock-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

🎉 All checks passed! This PR is ready for review.

…spatcher tests

Run black 25.12.0 and clang-format 18 on all changed files to satisfy
pre-commit. Add gfx90a unit tests to test_conv.py for dispatcher coverage.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add trailing newlines to model JSON files and remove extra blank line
in rocke_conv_engine/CMakeLists.txt.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@cderb cderb requested review from bartekxk and yraparti June 30, 2026 21:14
@cderb cderb marked this pull request as ready for review June 30, 2026 21:24
@cderb cderb requested review from a team as code owners June 30, 2026 21:24
…istics

- Add missing stride_w/pad_w to validation results (validate_ml_vs_oracle_conv)
- Convert sweep latency_us to latency_ms to match training pipeline (gen_conv_sweep_data)
- Cache ConvMLHeuristic across buildPlan calls to avoid reloading model from disk
- Remove dead depthwise coverage section (C/G=1 fails MFMA alignment constraint)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@therock-pr-bot

Copy link
Copy Markdown

Pre-commit check failed

pre-commit failed

Please run locally:

  • python -m pip install pre-commit
  • pre-commit install
  • pre-commit run --all-files --show-diff-on-failure

This repo uses .pre-commit-config.yaml.

@BradPepersAMD

Copy link
Copy Markdown
Contributor

This work is very cool and usefully shows a bunch of the pieces that we can build on and learn from. The base library is still in a lot of flux and likely things move around that breaks this PR and we are trying to focus on SDPA so we may delay landing convs until we are sure about how we organize the SDPA work. but all the pieces of the conv work are coming together here nicely and we will land some version of this in the next weeks!

cderb and others added 2 commits June 30, 2026 18:55
…ic tooling

Simplify ConvFwdParams to only fields needed at execute time (UIDs, byte
sizes, grid/block, kernel name), pre-compute byte sizes at build time.
Eliminate two heap allocations per predict_tflops call via pre-allocated
member buffers. Replace mkstemp with memfd_create and busy-wait with
blocking waitpid+alarm in sweep tool. Consolidate duplicated Python
helpers (HEADER, SHAPE_COLS, bucket functions, write_csv) into canonical
locations.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…n changes

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (76.92%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #8982   +/-   ##
========================================
  Coverage    71.33%   71.33%           
========================================
  Files         2628     2628           
  Lines       413043   413043           
  Branches     61875    61875           
========================================
+ Hits        294613   294617    +4     
+ Misses       96656    96653    -3     
+ Partials     21774    21773    -1     
Flag Coverage Δ *Carryforward flag
TensileLite 76.65% <ø> (ø) Carriedforward from 80f384f
hipBLAS 90.81% <ø> (ø) Carriedforward from 80f384f
hipBLASLt 41.35% <ø> (ø) Carriedforward from 80f384f
hipCUB 82.68% <ø> (ø) Carriedforward from 80f384f
hipDNN 85.92% <ø> (+0.01%) ⬆️
hipFFT 50.17% <ø> (ø) Carriedforward from 80f384f
hipRAND 76.12% <ø> (ø) Carriedforward from 80f384f
hipSOLVER 69.18% <ø> (ø) Carriedforward from 80f384f
hipSPARSE 86.55% <ø> (ø) Carriedforward from 80f384f
rocBLAS 48.06% <ø> (ø) Carriedforward from 80f384f
rocFFT 46.30% <ø> (ø) Carriedforward from 80f384f
rocRAND 57.07% <ø> (ø) Carriedforward from 80f384f
rocSOLVER 76.92% <ø> (ø) Carriedforward from 80f384f
rocSPARSE 72.37% <ø> (ø) Carriedforward from 80f384f
rocThrust 91.36% <ø> (ø) Carriedforward from 80f384f

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
.../include/hipdnn_data_sdk/utilities/EngineNames.hpp 96.23% <ø> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cderb and others added 2 commits July 1, 2026 13:07
…conv-models

# Conflicts:
#	dnn-providers/hip-kernel-provider/rocke/platform/Cpp/include/rocke/conv_ml_heuristic.h
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/augment_coverage_conv.py
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/gen_conv_sweep_data.py
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/gen_sweep_data.py
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/generate_coverage_conv.py
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/feature_spec.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/model_tflops.lgbm.gz
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx90a/train_manifest.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/feature_spec.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/model_tflops.lgbm.gz
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx942/train_manifest.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/feature_spec.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/model_tflops.lgbm.gz
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/models/grouped_conv_forward_fp16_gfx950/train_manifest.json
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/sample_shapes_conv.py
#	dnn-providers/hip-kernel-provider/rocke/platform/Python/rocke/heuristics/validate_ml_vs_oracle_conv.py
…alid refs

all_buf_{kConvFeatureCount} used brace-init which selects the
initializer_list<double> ctor, creating a 1-element vector (value 109.0)
instead of a 109-element buffer — heap overflow on every predict_tflops
call. Use explicit construction instead.

Also fix two _valid() call sites in generate_coverage_conv.py missed
during the _valid → conv_shape_valid rename.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants