RFC: Improve run assignment. #2443

vdbergh · 2026-01-15T21:22:35Z

This PR is inspired by the following test:

https://tests.stockfishchess.org/tests/view/696921e0fa8ace4d6d448009?show_task=26

A test will get any worker at most half the time since a worker cannot work twice in a row on the same test, no matter what throughput is set.

This is a problem for a test which has an arch filter for which there are few qualifying workers. Then its actual throughput will be substantially below its requested throughput.

We propose to correct this by sorting runs according to the following key (lexicographic ordering, lower is better)

-run["args"]["priority"],
run["cores"] > 0,
(run["cores"] + (3/2 if str(run["_id"]) == last_run_id else 1/2) * max_threads) / run["args"]["itp"],

In the last line we have readded the number of cores by this worker that were freed when the last task on this run finished. This ensures that the current worker does not automatically gets the previous run again, but it also does not prevent it in case run["cores"] is low compared to the requested throughput.

By comparison the current key is:

-run["args"]["priority"],
str(run["_id"]) == last_run_id,
run["cores"] > 0,
(run["cores"] + max_threads / 2) / run["args"]["itp"],

This means a different run is always preferred, unless there is no other possibility.

Copilot

Pull request overview

This PR improves the run assignment algorithm to better handle tests with architecture filters that have few qualifying workers. The previous algorithm prevented workers from being assigned to the same test twice in a row, which meant tests received workers at most half the time, causing their actual throughput to fall below the requested throughput.

Changes:

Modified the priority calculation to use a more nuanced approach for avoiding repeated assignment
Removed the absolute restriction (str(run["_id"]) == last_run_id) that always prevented reassignment
Removed the run["cores"] > 0 check in the priority function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

server/fishtest/rundb.py

ppigazzini · 2026-01-16T10:02:21Z

@vdbergh if I'm not wrong you are not active on Discord, so I report here a snippet of my post in sf-dev channel.

I created a proof‑of‑concept repository to rewrite and test the get_native_properties.sh script:
https://github.com/ppigazzini/get-native-properties
The repo:

includes a new script that supports all ARCHs defined in the Makefile and outputs only the ARCH, keeping it separate from asset‑filename logic
includes tests for every architecture with a light instrumentation
runs GitHub CI smoke tests on runner OSes and Docker images (unfortunately, QEMU gives mixed result in /proc/cpuinfo)

I'm interested in opinions before opening a PR :)

ppigazzini · 2026-01-16T10:06:19Z

DEV updated.

vdbergh · 2026-01-16T10:40:03Z

DEV updated.

I think we should test with tests having the same priority. The priority behavior has not changed. Tests with higher priority are unconditionally preferred.

vdbergh · 2026-01-16T10:41:28Z

@vdbergh if I'm not wrong you are not active on Discord, so I report here a snippet of my post in sf-dev channel.

I created a proof‑of‑concept repository to rewrite and test the get_native_properties.sh script: https://github.com/ppigazzini/get-native-properties The repo:
* includes a new script that supports all ARCHs defined in the Makefile and outputs only the ARCH, keeping it separate from asset‑filename logic

* includes tests for every architecture with a light instrumentation

* runs GitHub CI smoke tests on runner OSes and Docker images (unfortunately, QEMU gives mixed result in `/proc/cpuinfo`)
I'm interested in opinions before opening a PR :)

I forked the repo. I can only say that the code looks very nice. A perfect match between the script and the makefile is definitely a nice thing to have!

ppigazzini · 2026-01-16T11:24:42Z

DEV updated.

I think we should test with tests having the same priority. The priority behavior has not changed. Tests with higher priority are unconditionally preferred.

I added many tests with the same prio 5, though.

vdbergh · 2026-01-16T11:30:00Z

DEV updated.

I think we should test with tests having the same priority. The priority behavior has not changed. Tests with higher priority are unconditionally preferred.

I added many tests with the same prio 5, though.

Ok!

vdbergh · 2026-01-16T11:58:04Z

I have increased the throughput of the bmi2 test. That should now cause it to monopolize the bmi2 workers
(currently this test is not doing this since it is not necessary for the correct throughput).

EDIT: It seems to have worker. A bmi2 worker finished a task and it stayed with the bmi2 test. Previously it would have been forced do a task of another test first.

vdbergh · 2026-01-16T12:21:30Z

Now I reduced the throughput of the bmi2 test. That means that the bmi2 workers become free to roam around.

EDIT: It worked. A bmi2 worker moved to another test.

vdbergh · 2026-01-16T12:50:14Z

Things seem to be working as expected, but I am going to put the line

run["cores"] > 0

back it. Without that line a low throughput worker can be permanently stuck without cores. That's not what the user expects (if this were really their intention they would have set priority to -1).

A test will get any worker at most half the time since a worker cannot work twice in a row on the same test, no matter what throughput is set. This is a problem for a test which has an arch filter for which there are few qualifying workers. Then its actual throughput will be substantially below its requested throughput. We propose to correct this by sorting runs according to the following key (lexicographic ordering, lower is better) ======================================================== -run["args"]["priority"], run["cores"] > 0, (run["cores"] + (3/2 if str(run["_id"]) == last_run_id else 1/2) * max_threads) / run["args"]["itp"], ======================================================== In the last line we have readded the number of cores by this worker that were freed when the last task on this run finished. This ensures that the current worker does not automatically gets the previous run again, but it also does not prevent it in case run["cores"] is low compared to the requested throughput. By comparison the current key is: ======================================================== -run["args"]["priority"], str(run["_id"]) == last_run_id, run["cores"] > 0, (run["cores"] + max_threads / 2) / run["args"]["itp"], ======================================================== This means a different run is always preferred, unless there is no other possibility.

ppigazzini · 2026-01-16T13:22:26Z

DEV updated!

vdbergh · 2026-01-16T13:34:45Z

It seem to be working. I reduced throughput of the bmi2 test to 25% and it is still getting a worker. I will now raise the throughput again.

vdbergh · 2026-01-16T13:50:43Z

Everything seems to be as expected.

ppigazzini

Looks good on DEV, PROD updated after merging, thank you @vdbergh

vdbergh · 2026-01-16T17:26:31Z

Looking forward to the PR BTW!

ppigazzini · 2026-01-16T17:59:25Z

Looking forward to the PR BTW!

I’ve force‑pushed the initial commit - I tend to work a bit recklessly on my own projects :) - and did a substantial refactor and fixed the ARM detection. It should now be ready for the SF PR.
I'm reding about incus, the fork of lxd, that supports non-ubuntu distros, trying to run a qemu vm to check the armv7 cpu.

We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughut it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are readding the numbers of cores that were freed by this worker when the previous task on this run finished.

We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are readding the numbers of cores that were freed by this worker when the previous task on this run finished.

We fix an issue created by official-stockfish#2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves to the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are re-adding the numbers of cores that were freed by this worker when the previous task on this run finished.

We fix an issue created by #2443. Assume that X is a low throughput test, currently with 0 cores. Then in current master it moves to the front of the queue. Assume it manages to pick up a large core worker Y. Since X is low throughput it will not pick up more workers. When Y finishes its task on X, X falls back to 0 cores and moves to the front of the queue again. Since Y is now looking for a new task it will probably pick up X again. So X and Y are permanently paired which is not desirable. In this PR we fix this by replacing run["cores"] by adjusted_cores = run["cores"] + (max_threads if run_id == last_run_id else 0) In other works we are re-adding the numbers of cores that were freed by this worker when the previous task on this run finished.

vdbergh force-pushed the last_run branch from 741835d to 75a57d0 Compare January 16, 2026 06:18

vdbergh changed the title ~~RFC: Modify run assignment.~~ RFC: Improve run assignment. Jan 16, 2026

vdbergh force-pushed the last_run branch from 75a57d0 to 06a1a10 Compare January 16, 2026 06:20

ppigazzini requested a review from Copilot January 16, 2026 08:57

Copilot started reviewing on behalf of ppigazzini January 16, 2026 08:58 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

server/fishtest/rundb.py Outdated Show resolved Hide resolved

server/fishtest/rundb.py Outdated Show resolved Hide resolved

vdbergh force-pushed the last_run branch from 06a1a10 to 35d9e45 Compare January 16, 2026 09:31

ppigazzini added enhancement server server side changes labels Jan 16, 2026

vdbergh force-pushed the last_run branch from 35d9e45 to 3c3b376 Compare January 16, 2026 12:58

ppigazzini approved these changes Jan 16, 2026

View reviewed changes

ppigazzini merged commit b9e37e8 into official-stockfish:master Jan 16, 2026
21 checks passed

vdbergh mentioned this pull request Jan 17, 2026

Tweak run assignment for low throughput runs. #2446

Merged

RFC: Improve run assignment. #2443

RFC: Improve run assignment. #2443

Conversation

vdbergh commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

ppigazzini commented Jan 16, 2026

Uh oh!

ppigazzini commented Jan 16, 2026

Uh oh!

vdbergh commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdbergh commented Jan 16, 2026

Uh oh!

ppigazzini commented Jan 16, 2026

Uh oh!

vdbergh commented Jan 16, 2026

Uh oh!

vdbergh commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdbergh commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdbergh commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppigazzini commented Jan 16, 2026

Uh oh!

vdbergh commented Jan 16, 2026

Uh oh!

vdbergh commented Jan 16, 2026

Uh oh!

ppigazzini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vdbergh commented Jan 16, 2026

Uh oh!

ppigazzini commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vdbergh commented Jan 15, 2026 •

edited

Loading

vdbergh commented Jan 16, 2026 •

edited

Loading

vdbergh commented Jan 16, 2026 •

edited

Loading

vdbergh commented Jan 16, 2026 •

edited

Loading

vdbergh commented Jan 16, 2026 •

edited

Loading