[HIPDNN] Tolerance lookup followup by bghimireamd · Pull Request #8752 · ROCm/rocm-libraries

bghimireamd · 2026-06-24T05:00:11Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Harness unification (ALMIOPEN-1969 follow-up): - Share input synthesis between both harnesses: move SynthesisTracker and SynthesizeInputs to harness/input_init/ (namespace hipdnn_integration_tests). The non-golden harness now drives its initializeBundle() through the same synthesis switch, with a random [-1,1] fallback. Remove 4 of 7 non-golden initializeBundle overrides whose ranges/seeds now match the shared fill functions; 3 remain (fused-graph range conflicts and a large-values stress test). - Rename harness/golden/ -> harness/bundle/ and namespace hipdnn_integration_tests::golden -> ::bundle. "golden" was inaccurate: the harness verifies bundles via golden data OR a GPU/CPU reference, and most bundles ship no golden output. Rename the class IntegrationGraphGoldenReferenceVerificationHarness -> IntegrationBundleVerificationHarness (parallel to its sibling IntegrationGraphVerificationHarness). Fix stale "golden" log/report strings that mislabeled all bundles or shared comparison paths. The golden verification *mode* and golden output *data* vocabulary, and the --golden-data-dir CLI flag, are retained (accurate / external). - Drop redundant hipInit/hipGetDevice/_deviceId from the graph harness: HIP initializes lazily and getSharedHandle()->hipdnnCreate() already does it before any graph runs; _deviceId was write-only. - Split bundle metadata guards: the VRAM check applies to every bundle the engine runs, but the GPU-arch lock only matters when comparing against golden output values (arch-specific). Gate the arch check on hasGoldenOutputs so an inputs-only bundle verified against a local reference is not wrongly skipped on a different arch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both harnesses resolved comparison tolerance with duplicated, divergent code. Replace both with one shared resolver (harness/tolerance/ToleranceResolver.hpp) where the aggregation policy is a plain function (GraphWrapper, dtype) -> float, selectable per caller. What changed: - New resolveTolerance(wrapper, dtype, testName, atol, rtol, policy=max): derives a default via the chosen aggregation policy, then applies the TOML per-test override (highest priority) in ONE place. Single tolerance entry point for both harnesses; the override is no longer applied separately in either, so the layering order lives here alone. - Two policies, each a free function (no enum/switch; the policy IS the function, C++17 function pointer): * maxAcrossNodes — loosest per-node tolerance; conservative envelope, never tighter than any node so it cannot cause a false failure. * outputOpTolerance — tolerance of the last non-Pointwise op (the output-producing op); reproduces the graph harness's historical getTolerance() behavior. - Wiring preserves existing behavior on both sides: * Graph harness getTolerance() -> outputOpTolerance (its prior policy), now expressed through the shared resolver instead of a private dynamic_cast switch. It also serializes via to_binary() and reads the output dtype from the flatbuffer, so it shares the resolver keyed on the same representation the bundle harness uses. * Bundle harness -> maxAcrossNodes (its prior policy), unchanged. The two policies agree for the common one-real-op + activation case (activation is Pointwise -> skipped; the single real op is both loosest and last), so they differ only on multi-real-op fusions, which carry explicit overrides today. - Deletes the duplicate per-op tolerance switch that existed in each harness (one on dynamic_cast of live nodes, one on the flatbuffer NodeAttributes enum); both now share one flatbuffer-keyed lookup. - warnIfMultipleOutputs() guards the single-output assumption: every current policy reduces over the whole graph, not the per-output subgraph, so multi-output graphs are flagged loudly (deferred fix, ALMIOPEN-2216). Why output-op for the graph harness (not max): a mechanical migration of the C++ graph tests should not change their tolerances. outputOpTolerance keeps the exact prior numbers. Note this is a heuristic, not a principled tight bound (it attributes the whole output tolerance to one op); it is kept for migration parity. The principled tighten path is the future DynamicTolerances upgrade (see TODO in ToleranceResolver.hpp and ALMIOPEN-2216); FP4 also needs sub-bf16 entries in the dtype switch (currently falls through to 1e-3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bghimireamd and others added 2 commits June 23, 2026 22:04

bghimireamd changed the base branch from develop to users/bghimire/ALMIOPEN-1969/tolerance_lookup June 24, 2026 05:00

github-actions Bot added the project: hipdnn-integration-tests label Jun 24, 2026

assistant-librarian Bot added the organization: ROCm label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HIPDNN] Tolerance lookup followup#8752

[HIPDNN] Tolerance lookup followup#8752
bghimireamd wants to merge 2 commits into
users/bghimire/ALMIOPEN-1969/tolerance_lookupfrom
users/bghimire/ALMIOPEN-1969/tolerance_lookup_followup

bghimireamd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bghimireamd commented Jun 24, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant