Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resiliency of long-running async tasks #1081

Merged
merged 5 commits into from
Oct 10, 2024

Conversation

shsms
Copy link
Contributor

@shsms shsms commented Sep 27, 2024

Closes #1078

@shsms shsms requested a review from a team as a code owner September 27, 2024 15:00
@shsms shsms requested review from Marenz and removed request for a team September 27, 2024 15:00
@github-actions github-actions bot added part:data-pipeline Affects the data pipeline part:core Affects the SDK core components (data structures, etc.) part:microgrid Affects the interactions with the microgrid part:docs Affects the documentation labels Sep 27, 2024
@shsms shsms marked this pull request as draft September 27, 2024 15:08
@shsms
Copy link
Contributor Author

shsms commented Sep 27, 2024

Weirdly the pytest_max are getting stuck, but pytest_min is working. I will investigate.

@shsms
Copy link
Contributor Author

shsms commented Sep 27, 2024

Appears to be in the config manager issue. There have not been any changes recently, other than the updates to FileWatcher in channels.

Looks like this needs to be fixed in the channels repo.

@shsms shsms marked this pull request as ready for review September 27, 2024 15:19
@llucax
Copy link
Contributor

llucax commented Sep 30, 2024

We also have some issues with tests getting stuck in frequenz-floss/frequenz-dispatch-python#54, it also seems to be related to an update of the client-dispatch dependency that in turn has a channels updated dependency. It might be related to the changes in Timer rather than in FileWatcher.

@llucax
Copy link
Contributor

llucax commented Sep 30, 2024

@Marenz was doing some debugging for that, I tried to help but didn't have a lot of time to dedicate to this, but maybe we need to increase its priority if it is affecting more projects.

@shsms shsms force-pushed the broken-task-fixes branch from aa0cf6c to b09caee Compare October 7, 2024 08:58
llucax
llucax previously approved these changes Oct 7, 2024
Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. We should probably also move run_forever to core (frequenz-floss/frequenz-core-python#33).

@@ -28,6 +32,25 @@ async def cancel_and_await(task: asyncio.Task[Any]) -> None:
pass


async def run_forever(
async_callable: Callable[[], Coroutine[Any, Any, None]],
interval: timedelta = timedelta(seconds=1),
Copy link
Contributor

@llucax llucax Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future it might be nice to be able to pass a retry strategy. A good reason to move the retry module from client-base to core (frequenz-floss/frequenz-core-python#34).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExponentialBackoff is hard to do with generic functions, because it needs to know if a previous attempt succeeded, so that it can reset its backoff interval.

This is easy to do with streams, we just reset it after every incoming message. But for functions, we'd need something like a timer to reset the backoff interval, if the function hasn't failed for a certain interval, then we say it succeeded, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we can have a strategy that actually takes into account for how long the task ran, so if last time it ran for a long time it waits very little to restart and if it failed immediately it waits quite a long time to restart 🤔

@shsms shsms added this pull request to the merge queue Oct 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024
@shsms shsms added this pull request to the merge queue Oct 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024
@shsms shsms added this pull request to the merge queue Oct 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2024
@llucax
Copy link
Contributor

llucax commented Oct 8, 2024

What now with the damn tests 😭 😠

@shsms
Copy link
Contributor Author

shsms commented Oct 8, 2024

same timing issue in battery pool tests as in the previous PR that I had a magic fix for.

I'll look for a better fix that doesn't depend on the clock.

@github-actions github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Oct 8, 2024
@shsms shsms force-pushed the broken-task-fixes branch from a2c6b26 to 200e5d2 Compare October 8, 2024 12:38
@shsms shsms enabled auto-merge October 8, 2024 12:38
@shsms shsms added this pull request to the merge queue Oct 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2024
@shsms shsms added this pull request to the merge queue Oct 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 8, 2024
shsms added 5 commits October 10, 2024 15:37
Without this commit, the SoC calculation task would crash and not
recover if the upper and lower SoC bounds are the same.

Signed-off-by: Sahas Subramanian <[email protected]>
The function takes a callable that would return a corouting, and keep
running it forever.

This commit also replaces the custom `_run_forever` implementations
with the generic `run_forever`.

Signed-off-by: Sahas Subramanian <[email protected]>
This makes sure that when a streaming method for any of the battery
pool metrics raises an exception, the method will be restarted.

Signed-off-by: Sahas Subramanian <[email protected]>
These appear to be the only remaining long-running functions that
don't have their own exception handling.

Signed-off-by: Sahas Subramanian <[email protected]>
Signed-off-by: Sahas Subramanian <[email protected]>
@shsms
Copy link
Contributor Author

shsms commented Oct 10, 2024

@ela-kotulska-frequenz has fixed the merge-time-CI-problem here: #1085

This needs another approval.

@shsms shsms enabled auto-merge October 10, 2024 13:39
@shsms shsms added this pull request to the merge queue Oct 10, 2024
Merged via the queue into frequenz-floss:v1.x.x with commit 443e83e Oct 10, 2024
18 checks passed
@shsms shsms deleted the broken-task-fixes branch October 10, 2024 14:23
@llucax llucax mentioned this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:core Affects the SDK core components (data structures, etc.) part:data-pipeline Affects the data pipeline part:docs Affects the documentation part:microgrid Affects the interactions with the microgrid part:tests Affects the unit, integration and performance (benchmarks) tests
Projects
Development

Successfully merging this pull request may close these issues.

SoC does not get updated
3 participants