-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activities fail with timeouts when the .NET worker process crashes #2939
Comments
Hi @dxynnez. I wonder if you might be running into a variant of #2788, where a worker process crash causes the orchestration to fail rather than being automatically retried. If so, we recently released a fix in the v1.1.7 release. Can you try this updated nuget package and let us know if it resolves your issue? |
Hi @cgillum , thanks for helping on this. But I am not entirely sure if the root cause is the same as: So our activity function would emit logs at different stages of execution. However, we did not see any log from the particular activity function after the restart of the worker process (that’s why we suspect that the unfinished activity function didnt get picked up again after the restart) - we only see logs from the particular activity function before the restart. But the activity function itself did not timeout as per our logs it only ran for like ~20mins (not finished because of the worker process restart). The restart was caused by other function which ran more than 1hr and timeout. |
@dxynnez thanks for the additional context. I recognize that the symptoms you are seeing are different from what others have reported, but I still think there's a chance that this might be resolved in the v1.1.7 release, which addresses several possible problems related to worker process restarts. If you can, please upgrade to this version of the extension to see if it improves the situation. Otherwise, please open an Azure support request so that we can look more closely at our system telemetry to see if we can further root cause what happened. |
Thanks @cgillum . we will try that. One question though - would the orchestration function itself timeout? Also when an orchestration function replays and invokes an already-started but not-yet-finished activity function, if it's already more than the 'functionTimeout' time past the original activity started time, would the orchestration function deem that the activity function already timeout and fail directly, without waiting for an explicit timeout signal from the activity function? The orchestration function was throwing:
|
No, orchestrator function execution is considered "completed" by the Functions runtime whenever they await an activity or some other task (even though the orchestration itself remains in a "Running" state). This is why orchestrator functions can run for infinite periods of time, and also why a long activity execution will not cause the orchestrator function to timeout.
If an orchestration successfully schedules an activity function to run, it will never schedule it to run again during a replay. It doesn't matter whether the activity actually started, completed, or is still running.
Aha, this tells me that the activity failed with an exception, so it was considered "complete" by the framework and therefore doesn't need to be retried. I think this is fundamentally what the issue is - the timeout caused the activity to fail, which is why it doesn't get retried. The fact that the worker process was also restarted wouldn't matter, since it restarted after the failure event was sent to and received by the orchestration. To work around this, you can simply introduce a retry policy for your slow activity function call. That way, the orchestration knows that it should retry the activity function if it sees an exception like this. Documentation on retry policies can be found here: https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-error-handling?tabs=csharp-inproc#automatic-retry-on-failure |
Thanks @cgillum !
the case is more complicated than that: so the particular invocation of the activity function didn't timeout, it executed for around 20mins (we set the 'functionTimeout' to 1hr) before being terminated as a result of the worker process restart, which was triggered by a timeout from another function. After the worker restarted, we did not see any sign of the particular activity function being picked up again (no logs) - we expect that after the restart the worker should somehow rerun the activity function as it's 'incomplete', but we only observe that the above exception being thrown from the owning orchestration function, a few minutes past the worker restart as at that point it was indeed 1hrs past when the activity function was scheduled.
I think the confusing part is how would the framework consider the activity function to be complete - the only execution of the activity function got terminated half way through and never got picked up again. Retry policy probably wouldn't help in our case - since this is a cascaded timeout events, the retry would most likely get terminated again when the worker process get restarted. I hope that makes things clearer. |
@dxynnez I understand the nuance you're describing, but I'm trying to rationalize it with the error message you shared earlier:
Is the orchestration that encountered this error the parent of the activity call which was killed after 20 minutes (in which case the error message is wrong) and isn't retrying? Or is this error message associated with some other orchestration? |
Yes, the orchestrator is the parent of the mentioned activity function. The error is thrown when the orchestrator function got replayed, after the worker process restart. I am not sure how many times the orchestrator got replayed, but we did not see any sign of the re-execution of the activity function - again the activity function did not finish in its 1st execution (got terminated). We expect after the worker process restart, the framework should notice its unfinished state and restart its execution, which never seemed to happen. |
Thanks @dxynnez. I'm consulting with a few other engineers on the team, and one theory we have is that the functions host is not being properly notified when there is a worker process recycle. As a result, your activity function may have been killed after just 20 minutes (due to the recycle) but the host doesn't know this, and thinks the activity is still running. As a result, it eventually hist the 1-hour timeout limit and is reported as failed (which is why it doesn't get retried). We're still working through the details, but the bug you're encountering might be in the Azure Functions host. We'll share more updates when we have them. |
Thank @cgillum ! That sounds like a reasonable guess, please keep us updated! |
Before this can be fixed in the extension, the Functions Host should fix Azure/azure-functions-host#10580 |
Description
Hello team, we have an activity function which runs fairly slow (we understand that's not the best practice and we are working on a rework of the related code). During the execution of the activity function, another function timeout triggered a restart of the function process and terminated the execution of the said activity function. After the restart, the activity function never seemed to be picked up again and eventually lead to a timeout of the parent orchestration function - the behavior results into cascaded timeout & restarts.
We want to understand what might have caused the activity function to not be picked up? We have set the maxConcurrentActivityFunctions to 4000 and according to our telemetry the threshold didn't get hit. After the restart, the function process on the particular worker seemed to be working fine as we observed telemetry from other functions, just not the particular one that got terminated. Thread count / memory / CPU all look normal for the function app on the worker instance.
Expected behavior
unfinished activity function to be picked up again after the function process restarts
Actual behavior
unfinished activity never get picked up
App Details
The text was updated successfully, but these errors were encountered: