Skip to content
This repository was archived by the owner on Sep 18, 2023. It is now read-only.
This repository was archived by the owner on Sep 18, 2023. It is now read-only.

TimeoutError: Deadline exceeded on activities taking greating than 120 seconds #15

@chad-codes

Description

@chad-codes

Here's the stack trace.

2021-05-13 13:22:37,579 | ERROR | retry.py:retry_loop:29 | run failed: Deadline exceeded, retrying in 3 seconds
Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 360, in recv_initial_metadata
    headers = await self._stream.recv_headers()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 349, in recv_headers
    await self.headers_received.wait()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/asyncio/locks.py", line 309, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/retry.py", line 17, in retry_loop
    await fp(*args, **kwargs)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 1083, in run
    decision_task: PollWorkflowTaskQueueResponse = await self.poll()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 1165, in poll
    task = await self.service.poll_workflow_task_queue(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/api/workflowservice/v1.py", line 828, in poll_workflow_task_queue
    return await self._unary_unary(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/betterproto/__init__.py", line 1133, in _unary_unary
    response = await stream.recv_message()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 408, in recv_message
    await self.recv_initial_metadata()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 380, in recv_initial_metadata
    self.initial_metadata = im
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/utils.py", line 70, in __exit__
    raise self._error
asyncio.exceptions.TimeoutError: Deadline exceeded

Here's what I've uncovered:

The python-sdk starts up two parallel threads:

  • One to handle workflows (runs workflow logic and determines which activity needs to run)
  • One to handle activities (runs activity code)

Both threads use the same grpc channel to communicate with the temporal server with a timeout set to 120s. Concurrent RPC calls are supported according to the grpclib docs: https://grpclib.readthedocs.io/en/latest/client.html

The workflow thread polls the workflow task queue and the activity thread polls the activity task queue. Both take 60 seconds before continuing the while loop to poll again if nothing is returned. When the activity thread receives something on the activity task queue, it starts running the activity code. Meanwhile, the workflow thread is in the middle of polling the workflow task queue.

What I'm noticing is that the workflow poll request is "blocked" and doesn't return like it usually would after 60 seconds. The workflow poll request doesn't complete until the activity in the other thread is finished. If an activity takes long enough to complete, the workflow poll request can take more than 120 seconds (note the timeout mentioned earlier) causing a deadline exceeded error.

Solution:

  • Figure out why the workflow poll request is "stuck" and can't complete while an activity is running in the other thread.

A couple of temporary workaround:

  • Set the timeout to the max time that we expect activities could take. I don't know if there are consequences to this.

OR

  • Not worry about the deadline exceeded error because when it happens, it immediately continues the while loop and polls the workflow task queue again.
    • The downside to this is having this error clutter the worker logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions