You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've noticed this issue on our production and dev environments in Firebase, and launched an investigation together with GCP support (Support Case 49715474). GCP acknowledged the strange behaviour, but they were eventually unhelpful, offering generic advice on how it can be partially mitigated, which does not work in our use case, and it doesn't solve the issue. I continued the investigation on my own, and I think I pinned down the cause.
The issue seems to be rather generic and in my opinion very serious (silent failure -> no recovery until container goes out of scope), and happens for at least 1 use case which I presume is common in Firebase Functions: reading a document in Firestore using firebase_admin library (and hence google-cloud-firestore client). In my view it renders Firebase Functions for Python near-unusable if one needs to do operations on the firestore.
I decided to raise an issue here in an attempt to get some clarity, perhaps I am missing something. Please bear with me, as reproducing it takes a bit of a set up. Due to long feedback cycles (needing to deploy the function for every code change) and I am only reporting my particular use case, but I would be surprised that the issue is isolated only to google-cloud-firestore client.
I'd be very grateful for any suggestions to solve or mitigate this issue in the meanwhile, as we've completely blocked on that right now.
In the even of cold start, Python function is silently freezing on an attempt to read a document in firestore (firestore.client().document.get()) if multiple invocations were requested in a short period of time (in my tests I used as low as 5 invocations submitted as quickly as possible).
Further attempts to invoke the function do not produce any result.
This behaviour is observed in both http-triggered functions and firestore-triggered functions.
Reproduction
Environment
Standard Firebase Functions for Python setup, as suggested by the official documentation.
Here I provide minimal code to reproduce the behaviour. I provide the code for http-triggered function, as it is easier to test (for firestore-triggered need to get service key to write to db, etc; but the behavior is exactly the same).
In Firestore, create a collection "test" with document "testDoc" (and any content, e.g. {"test" : "test"}). I did it manually through Firebase Console. However, in reality this is not necessary, as the issue is reproducible even without it.
In Cloud Run -> open Function -> Security -> Allow unauthenticated invocations
Summary:
Total: 20.0057 secs
Slowest: 0.0000 secs
Fastest: 0.0000 secs
Average: NaN secs
Requests/sec: 0.4999
Response time histogram:
Latency distribution:
Details (average, fastest, slowest):
DNS+dialup: NaN secs, 0.0000 secs, 0.0000 secs
DNS-lookup: NaN secs, 0.0000 secs, 0.0000 secs
req write: NaN secs, 0.0000 secs, 0.0000 secs
resp wait: NaN secs, 0.0000 secs, 0.0000 secs
resp read: NaN secs, 0.0000 secs, 0.0000 secs
Status code distribution:
Error distribution:
[10] Get "https://http-trig-read-fs-l-<...>-uc.a.run.app": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
As you can see, the endpoint is unresponsive. In the Cloud Run logs we can see the issue (I am posing the end of the log, there are no further log statements in response to this run:
The container becomes completely unresponsive, and functions run until they time out (can be seen in Cloud Run metrics -- run times just above 1 minute is when I did one of the load tests):
However, if after the container goes out of score, or the function is redeployed to renew the container revision, I first try to make a single request to "wake up" the container, it can then process very high loads.
The current issue seems extremely impactful for a very common use case, as it leads to silent failure on the backend and the backend becoming unresponsive for a long period of time. It is also quite difficult to catch and trace, as it depends on the cold / warm starts. We've seen our backend fail for weeks from time to time without any apparent cause, until we were able to pin down this issue. GCP support is not helpful in debugging or fixing the issue.
At this point, the only viable paths forward we see are to rewrite out backend in Node functions (which might not have this issue, but I am yet to test), or to move another tech stack migrating from Firebase completely.
The text was updated successfully, but these errors were encountered:
Same issue. Until this gets fixed, to avoid the timeout errors I had to set the minimum number of instances to 1 (which incurs an additional cost with GCP).
Also, the timeouts were commonly associated with with various internet bots invoking the functions, with several requests being generated during a cold start before the first request could be fulfilled. So, @antopolskiy's analysis and use of hey to generate a small load is accurate and essential to recreate the problem.
I suspect this is due to the issue described in grpc/grpc#36265 (grpcio is a dependency of google-cloud-firestore). grpc/grpc#36266 fixes this, but I suspect it won't be available until the next release of grpcio library (1.63.0).
Pinning grpcio==1.58.0 in my requirements.txt fixed the issue in my reproduction.
Context
We've noticed this issue on our production and dev environments in Firebase, and launched an investigation together with GCP support (Support Case 49715474). GCP acknowledged the strange behaviour, but they were eventually unhelpful, offering generic advice on how it can be partially mitigated, which does not work in our use case, and it doesn't solve the issue. I continued the investigation on my own, and I think I pinned down the cause.
The issue seems to be rather generic and in my opinion very serious (silent failure -> no recovery until container goes out of scope), and happens for at least 1 use case which I presume is common in Firebase Functions: reading a document in Firestore using
firebase_admin
library (and hencegoogle-cloud-firestore
client). In my view it renders Firebase Functions for Python near-unusable if one needs to do operations on the firestore.I decided to raise an issue here in an attempt to get some clarity, perhaps I am missing something. Please bear with me, as reproducing it takes a bit of a set up. Due to long feedback cycles (needing to deploy the function for every code change) and I am only reporting my particular use case, but I would be surprised that the issue is isolated only to
google-cloud-firestore
client.I'd be very grateful for any suggestions to solve or mitigate this issue in the meanwhile, as we've completely blocked on that right now.
I have raised a twin issue on
firebase-admin-python
as it concerns also that library: firebase/firebase-admin-python#774Description
firestore.client().document.get()
) if multiple invocations were requested in a short period of time (in my tests I used as low as 5 invocations submitted as quickly as possible).This behaviour is observed in both http-triggered functions and firestore-triggered functions.
Reproduction
Environment
Standard Firebase Functions for Python setup, as suggested by the official documentation.
Here I provide minimal code to reproduce the behaviour. I provide the code for http-triggered function, as it is easier to test (for firestore-triggered need to get service key to write to db, etc; but the behavior is exactly the same).
firebase.json
functions-python/requirements.txt
functions-python/main.py
(I added print statements before every code line so that it is apparent where the function freezes)
Then I deploy the function to a Firebase Project (with Blaze plan).
npx firebase-tools --version
12.5.3
npx firebase-tools deploy --only functions:functions-python
Additional setup:
Reproduction and results
I use hey to generate requests.
hey -n 10 -c 10 https://http-trig-read-fs-l-<...>.a.run.app
As you can see, the endpoint is unresponsive. In the Cloud Run logs we can see the issue (I am posing the end of the log, there are no further log statements in response to this run:
The container becomes completely unresponsive, and functions run until they time out (can be seen in Cloud Run metrics -- run times just above 1 minute is when I did one of the load tests):
However, if after the container goes out of score, or the function is redeployed to renew the container revision, I first try to make a single request to "wake up" the container, it can then process very high loads.
hey -n 1 -c 1 https://http-trig-read-fs-l-<...>-uc.a.run.app
hey -n 100 -c 100 https://http-trig-read-fs-l-<...>-uc.a.run.app
Conclusion
The current issue seems extremely impactful for a very common use case, as it leads to silent failure on the backend and the backend becoming unresponsive for a long period of time. It is also quite difficult to catch and trace, as it depends on the cold / warm starts. We've seen our backend fail for weeks from time to time without any apparent cause, until we were able to pin down this issue. GCP support is not helpful in debugging or fixing the issue.
At this point, the only viable paths forward we see are to rewrite out backend in Node functions (which might not have this issue, but I am yet to test), or to move another tech stack migrating from Firebase completely.
The text was updated successfully, but these errors were encountered: