Improve resiliency of long-running async tasks #1081

shsms · 2024-09-27T15:00:36Z

shsms · 2024-09-27T15:09:28Z

Weirdly the pytest_max are getting stuck, but pytest_min is working. I will investigate.

shsms · 2024-09-27T15:18:55Z

Appears to be in the config manager issue. There have not been any changes recently, other than the updates to FileWatcher in channels.

Looks like this needs to be fixed in the channels repo.

llucax · 2024-09-30T10:26:12Z

We also have some issues with tests getting stuck in frequenz-floss/frequenz-dispatch-python#54, it also seems to be related to an update of the client-dispatch dependency that in turn has a channels updated dependency. It might be related to the changes in Timer rather than in FileWatcher.

llucax · 2024-09-30T10:27:27Z

@Marenz was doing some debugging for that, I tried to help but didn't have a lot of time to dedicate to this, but maybe we need to increase its priority if it is affecting more projects.

llucax

Awesome. We should probably also move run_forever to core (frequenz-floss/frequenz-core-python#33).

llucax · 2024-10-07T14:36:16Z

src/frequenz/sdk/_internal/_asyncio.py

@@ -28,6 +32,25 @@ async def cancel_and_await(task: asyncio.Task[Any]) -> None:
        pass


+async def run_forever(
+    async_callable: Callable[[], Coroutine[Any, Any, None]],
+    interval: timedelta = timedelta(seconds=1),


In the future it might be nice to be able to pass a retry strategy. A good reason to move the retry module from client-base to core (frequenz-floss/frequenz-core-python#34).

ExponentialBackoff is hard to do with generic functions, because it needs to know if a previous attempt succeeded, so that it can reset its backoff interval.

This is easy to do with streams, we just reset it after every incoming message. But for functions, we'd need something like a timer to reset the backoff interval, if the function hasn't failed for a certain interval, then we say it succeeded, for example.

Yeah, maybe we can have a strategy that actually takes into account for how long the task ran, so if last time it ran for a long time it waits very little to restart and if it failed immediately it waits quite a long time to restart 🤔

llucax · 2024-10-08T07:45:49Z

What now with the damn tests 😭 😠

shsms · 2024-10-08T08:48:38Z

same timing issue in battery pool tests as in the previous PR that I had a magic fix for.

I'll look for a better fix that doesn't depend on the clock.

Without this commit, the SoC calculation task would crash and not recover if the upper and lower SoC bounds are the same. Signed-off-by: Sahas Subramanian <[email protected]>

The function takes a callable that would return a corouting, and keep running it forever. This commit also replaces the custom `_run_forever` implementations with the generic `run_forever`. Signed-off-by: Sahas Subramanian <[email protected]>

This makes sure that when a streaming method for any of the battery pool metrics raises an exception, the method will be restarted. Signed-off-by: Sahas Subramanian <[email protected]>

These appear to be the only remaining long-running functions that don't have their own exception handling. Signed-off-by: Sahas Subramanian <[email protected]>

Signed-off-by: Sahas Subramanian <[email protected]>

shsms · 2024-10-10T13:39:38Z

@ela-kotulska-frequenz has fixed the merge-time-CI-problem here: #1085

This needs another approval.

shsms requested a review from a team as a code owner September 27, 2024 15:00

shsms requested review from Marenz and removed request for a team September 27, 2024 15:00

github-actions bot added part:data-pipeline Affects the data pipeline part:core Affects the SDK core components (data structures, etc.) part:microgrid Affects the interactions with the microgrid part:docs Affects the documentation labels Sep 27, 2024

shsms marked this pull request as draft September 27, 2024 15:08

shsms marked this pull request as ready for review September 27, 2024 15:19

shsms force-pushed the broken-task-fixes branch from aa0cf6c to b09caee Compare October 7, 2024 08:58

llucax previously approved these changes Oct 7, 2024

View reviewed changes

This was referenced Oct 7, 2024

Move run_forever from the SDK to the asyncio module frequenz-floss/frequenz-core-python#33

Open

Move the retry module from client-base frequenz-floss/frequenz-core-python#34

Open

shsms added this pull request to the merge queue Oct 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024

shsms added this pull request to the merge queue Oct 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024

shsms added this pull request to the merge queue Oct 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024

shsms dismissed llucax’s stale review via a2c6b26 October 8, 2024 12:36

github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Oct 8, 2024

shsms force-pushed the broken-task-fixes branch from a2c6b26 to 200e5d2 Compare October 8, 2024 12:38

shsms enabled auto-merge October 8, 2024 12:38

ela-kotulska-frequenz previously approved these changes Oct 8, 2024

View reviewed changes

shsms added this pull request to the merge queue Oct 8, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2024

shsms added this pull request to the merge queue Oct 8, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2024

shsms added 5 commits October 10, 2024 15:37

Improve resiliency of SoC calculation

ed61541

Without this commit, the SoC calculation task would crash and not recover if the upper and lower SoC bounds are the same. Signed-off-by: Sahas Subramanian <[email protected]>

Update battery_pool metric streamers to run forever.

f344512

This makes sure that when a streaming method for any of the battery pool metrics raises an exception, the method will be restarted. Signed-off-by: Sahas Subramanian <[email protected]>

Add exception-recovery for more long-running functions

9bad8c9

These appear to be the only remaining long-running functions that don't have their own exception handling. Signed-off-by: Sahas Subramanian <[email protected]>

Update release notes

cd0c532

Signed-off-by: Sahas Subramanian <[email protected]>

shsms dismissed ela-kotulska-frequenz’s stale review via cd0c532 October 10, 2024 13:37

shsms force-pushed the broken-task-fixes branch from 200e5d2 to cd0c532 Compare October 10, 2024 13:37

shsms enabled auto-merge October 10, 2024 13:39

ela-kotulska-frequenz approved these changes Oct 10, 2024

View reviewed changes

shsms added this pull request to the merge queue Oct 10, 2024

Merged via the queue into frequenz-floss:v1.x.x with commit 443e83e Oct 10, 2024
18 checks passed

shsms deleted the broken-task-fixes branch October 10, 2024 14:23

llucax mentioned this pull request Oct 11, 2024

SoC does not get updated #1078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resiliency of long-running async tasks #1081

Improve resiliency of long-running async tasks #1081

shsms commented Sep 27, 2024 •

edited

Loading

shsms commented Sep 27, 2024

shsms commented Sep 27, 2024

llucax commented Sep 30, 2024

llucax commented Sep 30, 2024

llucax left a comment •

edited

Loading

llucax Oct 7, 2024 •

edited

Loading

shsms Oct 7, 2024

llucax Oct 8, 2024

llucax commented Oct 8, 2024

shsms commented Oct 8, 2024

shsms commented Oct 10, 2024

Improve resiliency of long-running async tasks #1081

Improve resiliency of long-running async tasks #1081

Conversation

shsms commented Sep 27, 2024 • edited Loading

shsms commented Sep 27, 2024

shsms commented Sep 27, 2024

llucax commented Sep 30, 2024

llucax commented Sep 30, 2024

llucax left a comment • edited Loading

Choose a reason for hiding this comment

llucax Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

shsms Oct 7, 2024

Choose a reason for hiding this comment

llucax Oct 8, 2024

Choose a reason for hiding this comment

llucax commented Oct 8, 2024

shsms commented Oct 8, 2024

shsms commented Oct 10, 2024

shsms commented Sep 27, 2024 •

edited

Loading

llucax left a comment •

edited

Loading

llucax Oct 7, 2024 •

edited

Loading