Skip to content

API call from deployment to deployment hangs forever #424

Open
@Clement-Lelievre

Description

@Clement-Lelievre

Hi,

I'm having an issue that I don't get locally, it happens in the following scenario:

  • I have two cog models deployed on Replicate (as Deployments)
  • one of them at some point calls the other (see snippet below)
  • they were built and deployed using cog==0.13.7 , replicate==1.0.4 , and the cog CLI 0.14.3, python 3.11, ubuntu==22.04

Here's how I call one deployment from the other:

from replicate.helpers import base64_encode_file

vectorizer_deployment = replicate.deployments.get(VECTORIZER_DEPLOYMENT)


with open(img_path, "rb") as f:
        b64 = base64_encode_file(f)
prediction = vectorizer_deployment.predictions.create(
            input={"images": [b64_images]} ,
        )
logger.debug(f"{prediction.id=}")
prediction.wait() # this line hangs forever after 30-ish GET requests

The called deployment does complete the inference, and I can see the status as succeeded on Replicate.
In the logs of the calling deployment, I can see about 30-ish GET requests, all looking like INFO:httpx:HTTP Request: GET https://api.replicate.com/v1/predictions/7atmc23wmsrga0cp7ag9y5s6pm "HTTP/1.1 200 OK"

I have investigated the replicate python client source code, I can see that the prediction.wait() method calls the '.reload()' method which itselfs performs the GET requests.
I've tried increasing the env var REPLICATE_POLL_INTERVAL but to no effect.

Strange thing is, as said above, locally it works! ie:

  • when I run locally in python the main endpoint everything works well (I run like predictor.predict(...) )
  • when I run locally with cog predict -i ..., inference goes through, but at the end after my inference completes I get this error log:
    {"logger": "cog.server.worker", "timestamp": "2025-04-15T19:11:52.878929Z", "exception": "Traceback (most recent call last):\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/worker.py\", line 299, in _consume_events\n self._consume_events_inner()\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/worker.py\", line 337, in _consume_events_inner\n ev = self._events.recv()\n ^^^^^^^^^^^^^^^^^^^\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/connection.py\", line 251, in recv\n return _ForkingPickler.loads(buf.getbuffer())\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: URLPath.__init__() missing 3 required keyword-only arguments: 'source', 'filename', and 'fileobj'", "severity": "ERROR", "message": "unhandled error in _consume_events"}

So far I'm clueless as to why everything suddenly hangs, making all my project useless. I guess it's due to the deployed environment.

@zeke @erbridge @meatballhat @aron @mattt

thanks for your help

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions