Skip to content

Consider starting the renewal of messages in DTFx.Core as soon as they are fetched #1150

@davidmrdavid

Description

@davidmrdavid

In DTFx.Core, the method RenewTaskOrchestrationWorkItemLockAsync is used to ensure a given worker maintains exclusivity over a given partition message. For example, in the Azure Storage backend, this messages renews the "message visibility timeout" so that the message does not get dequeued again, or at least until the visibility timeout expires.

This renewal flow is invoked when the message is being processed, which has a very specific meaning: we have not exceeded the "maxConcurrentOrchestrations" / "maxConcurrentActivities" limit, and therefore have enough capacity to process more messages.

This means that a message may be received by a given worker, but not become processable for a long time if the active orchestrators/activities match their "max concurrent" settings and are long-running. In that time, since we're not actively extending the message's visibilityTimeout, it is possible for the message to become visible again (possibly being dequeued by the same worker that already has that message!), therefore changing it's popReceipt, which in turn prevents us from successfully processing the copy of the message with the old popReceipt. This can lead to a cascade of errors.

I believe framework-level fix to this is to start renewing messages as soon as they're fetched/received, not just when they're being processed. This may require some refactoring in DTFx.Core's WorkItemDispatcher class, so it needs to be done with care.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions