Skip to content

Fix hanging core-relations tests. #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 7, 2025
Merged

Fix hanging core-relations tests. #28

merged 5 commits into from
May 7, 2025

Conversation

ezrosent
Copy link
Contributor

@ezrosent ezrosent commented May 6, 2025

Described this more in slack. There were two issues I found when looking through backtraces of hanging tests:

  1. Concurrent calls into the allocator during SortedWritesTable::merge were causing things to slow way down. To fix this, we add MiMalloc. We may want to gate this behind specific platforms in a future change.
  2. We were accidentally holding onto a dashmap lock during the index cache lookup code. This may have caused a deadlock (rayon calls itself recursively, but the index refresh code that takes up most of the accidental critical section actually has its own thread pool), but at the very least it probably forced a lot of parallel code to convoy in a very slow way.

Thanks to @yihozhang for finding the connection with the index cache.

@ezrosent ezrosent requested a review from yihozhang May 6, 2025 05:40
@ezrosent ezrosent marked this pull request as draft May 6, 2025 05:54
@ezrosent
Copy link
Contributor Author

ezrosent commented May 6, 2025

Seems like you can get another issue:

  • Index not cached before
  • Start FJ, look up new index, start to build it in parallel, this causes us to spawn a ton of tasks in the thread pool then do other tasks while waiting for those to finish.
  • One of those tasks may be continuing the FJ. potentially by looking up the same index. Index lock is held? Deadlock

Or something along those lines, at least. That one happens more rarely but will be trickier to fix.

@ezrosent ezrosent marked this pull request as ready for review May 6, 2025 06:04
@ezrosent ezrosent marked this pull request as draft May 6, 2025 06:20
@ezrosent
Copy link
Contributor Author

ezrosent commented May 6, 2025

b427c2a doesn't do the right thing. It actually runs most of the stuff on the parent thread pool, so the problem is still there.

@ezrosent ezrosent marked this pull request as ready for review May 6, 2025 06:25
Copy link
Contributor

@yihozhang yihozhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! This looks good

Do we still need the mimalloc allocator for tests?

@ezrosent
Copy link
Contributor Author

ezrosent commented May 7, 2025

I think I'll keep it in for now. Unfortunately we still get some weird excursions without it. I was able to get this after ~8 runs on a non-release build:

PASS [ 14.141s] core-relations tests::ac_fj_puresize

Normally takes 166ms.

@ezrosent ezrosent merged commit 4f52312 into main May 7, 2025
@ezrosent ezrosent deleted the fix-deadlock branch May 7, 2025 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants