-
Notifications
You must be signed in to change notification settings - Fork 317
Description
In DTFx.Core, the method RenewTaskOrchestrationWorkItemLockAsync is used to ensure a given worker maintains exclusivity over a given partition message. For example, in the Azure Storage backend, this messages renews the "message visibility timeout" so that the message does not get dequeued again, or at least until the visibility timeout expires.
This renewal flow is invoked when the message is being processed, which has a very specific meaning: we have not exceeded the "maxConcurrentOrchestrations" / "maxConcurrentActivities" limit, and therefore have enough capacity to process more messages.
This means that a message may be received by a given worker, but not become processable for a long time if the active orchestrators/activities match their "max concurrent" settings and are long-running. In that time, since we're not actively extending the message's visibilityTimeout, it is possible for the message to become visible again (possibly being dequeued by the same worker that already has that message!), therefore changing it's popReceipt, which in turn prevents us from successfully processing the copy of the message with the old popReceipt. This can lead to a cascade of errors.
I believe framework-level fix to this is to start renewing messages as soon as they're fetched/received, not just when they're being processed. This may require some refactoring in DTFx.Core's WorkItemDispatcher class, so it needs to be done with care.