Skip to content

[BUG FIX] Fix non-deterministic simulation on GPU (cont'd).#2907

Open
duburcqa wants to merge 4 commits into
Genesis-Embodied-AI:mainfrom
duburcqa:fix_gpu_contact_determinism
Open

[BUG FIX] Fix non-deterministic simulation on GPU (cont'd).#2907
duburcqa wants to merge 4 commits into
Genesis-Embodied-AI:mainfrom
duburcqa:fix_gpu_contact_determinism

Conversation

@duburcqa
Copy link
Copy Markdown
Collaborator

@duburcqa duburcqa commented Jun 6, 2026

Description

Follow-up to #2898. GPU simulation was still not run-to-run reproducible because two contact-pipeline stages consumed contacts in the racy atomic_add physical-slot order:

  • the contact sort keyed on a group key read from a physical-order scan (group_key = x of whichever contact was physically first in a geom-pair run), so the logical order was racy;
  • contact pruning ran its per-bucket hull on the racy within-bucket order through a non-transitive (u, v) tolerance sort, so the kept-contact set (and count) varied run-to-run.

Fix in contact.py (both the coop and serial func_clamp_prune_and_sort_contacts* kernels):

  • the sort key is now a total order on each contact's own position, (pos_x, geom_a, geom_b, pos_y, pos_z);
  • each prune bucket is deterministically pre-sorted by position before the coplanarity/hull, so the existing (tuned) (u, v) sort and monotone chain receive deterministic input.

Tuned tolerances and hull math are unchanged. CPU is serial/deterministic and unaffected. The perf-dispatch autotuner (monolith vs decomposed under prefer_decomposed_solver == -1) is a separate timing-based source, already pinned per backend in the test harness and out of scope here.

How Has This Been Tested?

New GPU-only tests/test_rigid_determinism.py: spawns independent processes (in-process resets cannot observe cross-process races) on the authored-decomposition tower, parametrized over solve variant (monolith/decomposed) x pruning (on/off), and compares per-step fingerprints in pipeline order so the first mismatch names the diverging stage (contact set -> narrowphase/pruning, order -> sort, velocity -> solve). Fails on main, passes with this fix. Tower step time is within measurement noise (<0.2%); no regression.

Checklist:

  • I tagged the title correctly (BUG FIX)
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@duburcqa duburcqa requested a review from YilingQiao as a code owner June 6, 2026 09:14
@duburcqa duburcqa force-pushed the fix_gpu_contact_determinism branch 3 times, most recently from 4ab7886 to 945276c Compare June 6, 2026 09:34
@duburcqa duburcqa force-pushed the fix_gpu_contact_determinism branch from 945276c to 8154a00 Compare June 6, 2026 09:39
@duburcqa duburcqa force-pushed the fix_gpu_contact_determinism branch from 51c4132 to 4944633 Compare June 6, 2026 22:24
@duburcqa duburcqa force-pushed the fix_gpu_contact_determinism branch from 4944633 to dada1c9 Compare June 7, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant