Description
Hi,
I'm having an issue that I don't get locally, it happens in the following scenario:
- I have two cog models deployed on Replicate (as
Deployments
) - one of them at some point calls the other (see snippet below)
- they were built and deployed using
cog==0.13.7
,replicate==1.0.4
, and thecog CLI 0.14.3
,python 3.11
,ubuntu==22.04
Here's how I call one deployment from the other:
from replicate.helpers import base64_encode_file
vectorizer_deployment = replicate.deployments.get(VECTORIZER_DEPLOYMENT)
with open(img_path, "rb") as f:
b64 = base64_encode_file(f)
prediction = vectorizer_deployment.predictions.create(
input={"images": [b64_images]} ,
)
logger.debug(f"{prediction.id=}")
prediction.wait() # this line hangs forever after 30-ish GET requests
The called deployment does complete the inference, and I can see the status as succeeded
on Replicate.
In the logs of the calling deployment, I can see about 30-ish GET requests, all looking like INFO:httpx:HTTP Request: GET https://api.replicate.com/v1/predictions/7atmc23wmsrga0cp7ag9y5s6pm "HTTP/1.1 200 OK"
I have investigated the replicate python client source code, I can see that the prediction.wait()
method calls the '.reload()' method which itselfs performs the GET requests.
I've tried increasing the env var REPLICATE_POLL_INTERVAL
but to no effect.
Strange thing is, as said above, locally it works! ie:
- when I run locally in python the main endpoint everything works well (I run like predictor.predict(...) )
- when I run locally with
cog predict -i ...
, inference goes through, but at the end after my inference completes I get this error log:
{"logger": "cog.server.worker", "timestamp": "2025-04-15T19:11:52.878929Z", "exception": "Traceback (most recent call last):\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/worker.py\", line 299, in _consume_events\n self._consume_events_inner()\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/cog/server/worker.py\", line 337, in _consume_events_inner\n ev = self._events.recv()\n ^^^^^^^^^^^^^^^^^^^\n File \"/root/.pyenv/versions/3.11.10/lib/python3.11/multiprocessing/connection.py\", line 251, in recv\n return _ForkingPickler.loads(buf.getbuffer())\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: URLPath.__init__() missing 3 required keyword-only arguments: 'source', 'filename', and 'fileobj'", "severity": "ERROR", "message": "unhandled error in _consume_events"}
So far I'm clueless as to why everything suddenly hangs, making all my project useless. I guess it's due to the deployed environment.
@zeke @erbridge @meatballhat @aron @mattt
thanks for your help