-
Notifications
You must be signed in to change notification settings - Fork 141
RFC: Improve run assignment. #2443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the run assignment algorithm to better handle tests with architecture filters that have few qualifying workers. The previous algorithm prevented workers from being assigned to the same test twice in a row, which meant tests received workers at most half the time, causing their actual throughput to fall below the requested throughput.
Changes:
- Modified the priority calculation to use a more nuanced approach for avoiding repeated assignment
- Removed the absolute restriction (
str(run["_id"]) == last_run_id) that always prevented reassignment - Removed the
run["cores"] > 0check in the priority function
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@vdbergh if I'm not wrong you are not active on Discord, so I report here a snippet of my post in sf-dev channel. I created a proof‑of‑concept repository to rewrite and test the get_native_properties.sh script:
I'm interested in opinions before opening a PR :) |
|
DEV updated. |
I think we should test with tests having the same priority. The priority behavior has not changed. Tests with higher priority are unconditionally preferred. |
I forked the repo. I can only say that the code looks very nice. A perfect match between the script and the makefile is definitely a nice thing to have! |
I added many tests with the same prio 5, though. |
Ok! |
|
I have increased the throughput of the bmi2 test. That should now cause it to monopolize the bmi2 workers EDIT: It seems to have worker. A bmi2 worker finished a task and it stayed with the bmi2 test. Previously it would have been forced do a task of another test first. |
|
Now I reduced the throughput of the bmi2 test. That means that the bmi2 workers become free to roam around. EDIT: It worked. A bmi2 worker moved to another test. |
|
Things seem to be working as expected, but I am going to put the line back it. Without that line a low throughput worker can be permanently stuck without cores. That's not what the user expects (if this were really their intention they would have set priority to -1). |
A test will get any worker at most half the time since a worker cannot work twice in a row on the same test, no matter what throughput is set. This is a problem for a test which has an arch filter for which there are few qualifying workers. Then its actual throughput will be substantially below its requested throughput. We propose to correct this by sorting runs according to the following key (lexicographic ordering, lower is better) ======================================================== -run["args"]["priority"], run["cores"] > 0, (run["cores"] + (3/2 if str(run["_id"]) == last_run_id else 1/2) * max_threads) / run["args"]["itp"], ======================================================== In the last line we have readded the number of cores by this worker that were freed when the last task on this run finished. This ensures that the current worker does not automatically gets the previous run again, but it also does not prevent it in case run["cores"] is low compared to the requested throughput. By comparison the current key is: ======================================================== -run["args"]["priority"], str(run["_id"]) == last_run_id, run["cores"] > 0, (run["cores"] + max_threads / 2) / run["args"]["itp"], ======================================================== This means a different run is always preferred, unless there is no other possibility.
|
DEV updated! |
|
It seem to be working. I reduced throughput of the bmi2 test to 25% and it is still getting a worker. I will now raise the throughput again. |
|
Everything seems to be as expected. |
ppigazzini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on DEV, PROD updated after merging, thank you @vdbergh
|
Looking forward to the PR BTW! |
I’ve force‑pushed the initial commit - I tend to work a bit recklessly on my own projects :) - and did a substantial refactor and fixed the ARM detection. It should now be ready for the SF PR. |
We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughut it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are readding the numbers of cores that were freed by this worker when the previous task on this run finished.
We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are readding the numbers of cores that were freed by this worker when the previous task on this run finished.
We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are readding the numbers of cores that were freed by this worker when the previous task on this run finished.
We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves to the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are re-adding the numbers of cores that were freed by this worker when the previous task on this run finished.
We fix an issue created by #2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves to the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are re-adding the numbers of cores that were freed by this worker when the previous task on this run finished.
This PR is inspired by the following test:
https://tests.stockfishchess.org/tests/view/696921e0fa8ace4d6d448009?show_task=26
A test will get any worker at most half the time since a worker cannot work twice in a row on the same test, no matter what throughput is set.
This is a problem for a test which has an arch filter for which there are few qualifying workers. Then its actual throughput will be substantially below its requested throughput.
We propose to correct this by sorting runs according to the following key (lexicographic ordering, lower is better)
In the last line we have readded the number of cores by this worker that were freed when the last task on this run finished. This ensures that the current worker does not automatically gets the previous run again, but it also does not prevent it in case
run["cores"]is low compared to the requested throughput.By comparison the current key is:
This means a different run is always preferred, unless there is no other possibility.