-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service crash during tests of field goal for 30+ schedulers in 1million nodes tests #84
Comments
Cherry-pick the following two commits from PR85, when testing 36 schedulers, the error is reproduced. commit ebf3b7190710301d71da024147b12924b585f87a (HEAD -> Carl_Test_By_PR81)
commit 38f121177aac9363cd7a35d6c722cfb5251df0b8
|
PR 85 is merged into main branch, test 36 schedulers again. |
The same issue is there.
|
the checkpoints map in the event type, in event.go file, is missing the R/W lock. this map being used in aggregator, distributor and service routines so the concurrency needs to be enforced for the map. |
After cherry-pick PR86 and do the tests for 36 schedulers under 1 million nodes successfully without service crash. Here is summary of 12th, 24th, 36th schedulers. 12th scheduler:
24th scheduler:
36th scheduler:
|
After cherry-pick PR86 and do the tests for 36 schedulers under 1 million nodes successfully without service crash. Here is summary of 14th, 28th, 40th schedulers. In theory, the 40th scheduler should not be allocated with 25000 requested machines due to "no enough hosts", it is wired. Do further tests with 3 x 14 schedulers to see what happens. 14th scheduler:
28th scheduler:
40th scheduler:
|
Resolving per fix in PR #93 |
In https://github.com/yb01/arktos/wiki/730-test, for 'Field goal' case - 1million nodes / 5 regions / 40 schedulers / 25K nodes per scheduler, when number of scheduler reaches 30 or 30+ , service crash frequently happens.
Here is statistics of test between 2022-07-12 afternoon through 2022-07-13 morning:
The following error was caught by log:
The text was updated successfully, but these errors were encountered: