-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resiliency of long-running async tasks #1081
Conversation
Weirdly the |
Appears to be in the config manager issue. There have not been any changes recently, other than the updates to FileWatcher in channels. Looks like this needs to be fixed in the channels repo. |
We also have some issues with tests getting stuck in frequenz-floss/frequenz-dispatch-python#54, it also seems to be related to an update of the |
@Marenz was doing some debugging for that, I tried to help but didn't have a lot of time to dedicate to this, but maybe we need to increase its priority if it is affecting more projects. |
aa0cf6c
to
b09caee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. We should probably also move run_forever
to core
(frequenz-floss/frequenz-core-python#33).
@@ -28,6 +32,25 @@ async def cancel_and_await(task: asyncio.Task[Any]) -> None: | |||
pass | |||
|
|||
|
|||
async def run_forever( | |||
async_callable: Callable[[], Coroutine[Any, Any, None]], | |||
interval: timedelta = timedelta(seconds=1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future it might be nice to be able to pass a retry strategy. A good reason to move the retry
module from client-base
to core
(frequenz-floss/frequenz-core-python#34).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExponentialBackoff is hard to do with generic functions, because it needs to know if a previous attempt succeeded, so that it can reset its backoff interval.
This is easy to do with streams, we just reset it after every incoming message. But for functions, we'd need something like a timer to reset the backoff interval, if the function hasn't failed for a certain interval, then we say it succeeded, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe we can have a strategy that actually takes into account for how long the task ran, so if last time it ran for a long time it waits very little to restart and if it failed immediately it waits quite a long time to restart 🤔
What now with the damn tests 😭 😠 |
same timing issue in battery pool tests as in the previous PR that I had a magic fix for. I'll look for a better fix that doesn't depend on the clock. |
a2c6b26
to
200e5d2
Compare
Without this commit, the SoC calculation task would crash and not recover if the upper and lower SoC bounds are the same. Signed-off-by: Sahas Subramanian <[email protected]>
The function takes a callable that would return a corouting, and keep running it forever. This commit also replaces the custom `_run_forever` implementations with the generic `run_forever`. Signed-off-by: Sahas Subramanian <[email protected]>
This makes sure that when a streaming method for any of the battery pool metrics raises an exception, the method will be restarted. Signed-off-by: Sahas Subramanian <[email protected]>
These appear to be the only remaining long-running functions that don't have their own exception handling. Signed-off-by: Sahas Subramanian <[email protected]>
Signed-off-by: Sahas Subramanian <[email protected]>
200e5d2
to
cd0c532
Compare
@ela-kotulska-frequenz has fixed the merge-time-CI-problem here: #1085 This needs another approval. |
Closes #1078