Skip to content

fix the race condition in lu factorization #1850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented May 23, 2025

This PR fixes the race condition in lu factorization.

we need a sync before grabbing val[lower_nz] because the warp can modify the entry in the previous iteration.
We only need the warp sync. The wait(dep) implicitly has that if I understand it correctly, so I just move it after the wait(dep).
After move, we need another sync before assigning scale to ensure every thread gets the data before modification.

@yhmtsai yhmtsai added this to the Ginkgo 1.10.0 milestone May 23, 2025
@yhmtsai yhmtsai requested review from upsj and a team May 23, 2025 16:13
@yhmtsai yhmtsai self-assigned this May 23, 2025
@yhmtsai yhmtsai added 1:ST:ready-for-review This PR is ready for review is:bugfix This fixes a bug labels May 23, 2025
@ginkgo-bot ginkgo-bot added mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. type:factorization This is related to the Factorizations labels May 23, 2025
@yhmtsai yhmtsai force-pushed the fix_factorization_race branch from 4f684b5 to 84c1bbe Compare May 23, 2025 16:42
Comment on lines +119 to +122
// We need to load vals after synchronize.
// the next lower_nz might be modified if the dep row has the same col
// as next lower_nz's col.
const auto val = vals[lower_nz];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow - each warp only modifies memory locations belonging to its row, so there are only data races between threads of the same warp. So is this reordering actually necessary?

Copy link
Member

@upsj upsj May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you essentially want to do is add another warp sync at the end of each iteration of the outer loop, to prevent the modification of previous loops racing with ready from the following loop iterations? I would prefer having that sync happen explicitly at the end of the loop than hidden inside the scheduler wait function. I don't think warp syncs should be costly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It is to ensure that the modification of the same warp can be seen from the others when getting it.
Yes, it is what I mean in the PR description. We only need warp sync.
I thought the wait(dep) should also imply that because it make the memory visible in block at least.
I just move that rather than introducing the warp sync

const auto diag = vals[diag_idx];
// we need sync to ensure all threads get the data before assigning to
// scale.
warp.sync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, good catch!

@yhmtsai yhmtsai requested a review from a team May 28, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review is:bugfix This fixes a bug mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. type:factorization This is related to the Factorizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants