Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in onSnapshot code crashes the process #2280

Closed
jerzozwierz opened this issue Jan 8, 2025 · 6 comments
Closed

Race condition in onSnapshot code crashes the process #2280

jerzozwierz opened this issue Jan 8, 2025 · 6 comments
Assignees
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jerzozwierz
Copy link

Bottom line up front

When the conditions are just right, it's possible for a grpc stream (underlying an onSnapshot query) to emit an error event while not having any error handlers, killing the node process. It's rare, but our server's heavy traffic causes it to happen multiple times per day.

  1. Is this a client library issue or a product issue?
    Client library. A transient error isn't properly handled by the library under certain conditions.

  2. Did someone already solve this?
    None of my searches have yielded results.

  3. Do you have a support contract?
    I don't

Environment details

  • OS: Debian GNU/Linux 12 (inside Docker)
  • Node.js version: 20.18.0
  • npm version: 10.8.2
  • @google-cloud/firestore version: 7.4.0

Steps to reproduce

While I can't write code that reliably reproduces the problem (it's a low probability race condition), I can point to a code path that is hit and support my hypothesis with logs emitted by our server.

https://github.com/googleapis/nodejs-firestore/blob/main/dev/src/watch.ts#L483-L513

// inside Watch.initStream() method
...
        return this.firestore
          .requestStream( // (1)
            'listen',
            /* bidirectional= */ true,
            request,
            this.requestTag
          )
          .then(backendStream => {
            if (!this.isActive) { // (2)
              logger(
                'Watch.initStream',
                this.requestTag,
                'Closing inactive stream'
              );
              backendStream.emit('end');
              return; // (3)
            }
            logger('Watch.initStream', this.requestTag, 'Opened new stream');
            this.currentStream = backendStream;
            this.resetIdleTimeout();

            this.currentStream!.on('data', (proto: api.IListenResponse) => {
              this.resetIdleTimeout();
              this.onData(proto);
            })
              .on('error', err => { // (4)
                if (this.currentStream === backendStream) {
                  this.currentStream = null;
                  this.maybeReopenStream(err);
                }
              })
  1. First thing that must happen is a query being closed very soon after opening. This way, before the async call (1), the .isActive field is flipped back to false. Subsequently, the code enters the if clause (2). Note that, in this path, we exit the function at (3) and backendStream doesn't get an error handler like it does in (4).
  2. Secondly, the newly opened backendStream must emit an error (it can be anything, like a transient networking issue). The fact that it receives an end event earlier doesn't preclude it from doing so.
  3. Consequently, that emitted error crashes the node process.

Logs (I have selectively enabled certain firestore's internal logs to be formatted and logged to console):

T21:39:11.502801006Z Sending request: { [redacted] } { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.503399310Z Marking stream as healthy { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.503738877Z Stream ready { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.504049930Z Closing inactive stream { requestTag: 'rvKwm', fun: '[Watch.initStream]' }
T21:39:11.504533921Z Received stream end { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.017741217Z Received stream error: Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
T22:39:12.017790903Z     at callErrorFromStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
T22:39:12.017796528Z     at Object.onReceiveStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:420:73)
T22:39:12.017799765Z     at Object.onReceiveStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
T22:39:12.017802736Z     at /usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
T22:39:12.017805701Z     at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
T22:39:12.017808736Z for call at
T22:39:12.017812711Z     at ServiceClientImpl.makeBidiStreamRequest (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:404:32)
T22:39:12.017815664Z     at ServiceClientImpl.<anonymous> (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
T22:39:12.017819232Z     at /usr/src/app/node_modules/@google-cloud/firestore/build/src/v1/firestore_client.js:239:29
T22:39:12.017839883Z     at /usr/src/app/node_modules/google-gax/build/src/streamingCalls/streamingApiCaller.js:46:28
T22:39:12.017842602Z     at /usr/src/app/node_modules/google-gax/build/src/normalCalls/timeout.js:44:16
T22:39:12.017845234Z     at StreamProxy.setStream (/usr/src/app/node_modules/google-gax/build/src/streamingCalls/streaming.js:419:24)
T22:39:12.017847960Z     at StreamingApiCaller.call (/usr/src/app/node_modules/google-gax/build/src/streamingCalls/streamingApiCaller.js:54:16)
T22:39:12.017863772Z     at /usr/src/app/node_modules/google-gax/build/src/createApiCall.js:112:30
T22:39:12.017866637Z     at runNextTicks (node:internal/process/task_queues:60:5)
T22:39:12.017869664Z     at listOnTimeout (node:internal/timers:545:9)
T22:39:12.017872719Z     at process.processTimers (node:internal/timers:519:7) {
T22:39:12.017875762Z        code: 13,
T22:39:12.017878562Z        details: 'Received RST_STREAM with code 2 (Internal server error)',
T22:39:12.017881255Z        metadata: Metadata { internalRepr: Map(0) {}, options: {} }
T22:39:12.017883953Z } { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.017886915Z Received stream end { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.019839360Z node:events:496
T22:39:12.019872407Z     throw er; // Unhandled 'error' event
T22:39:12.019876840Z ^
T22:39:12.019879811Z
T22:39:12.019882613Z Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
@jerzozwierz jerzozwierz added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jan 8, 2025
@product-auto-label product-auto-label bot added the api: firestore Issues related to the googleapis/nodejs-firestore API. label Jan 8, 2025
@ehsannas ehsannas self-assigned this Jan 10, 2025
@ehsannas
Copy link
Contributor

Thanks for reporting @jerzozwierz . I'll take a look

@jerzozwierz
Copy link
Author

jerzozwierz commented Jan 13, 2025

Thanks @ehsannas, let me know if I can be of any further help! In the meantime, I wanted to add two dots:

  • After upgrading firestore lib to the latest (7.11.0), we're still seeing the same problem
  • In all instances, the unhandled error appears exactly 1 hour after ending the stream (one can see it in the log timestamps above)

@ehsannas
Copy link
Contributor

Thanks for the information, @jerzozwierz . I think it is true that a stream may receive an error even after emitting end. If your hypothesis about the series of events is correct, #2283 should fix this issue. Could you apply this fix and let me know if it helps?

@jerzozwierz
Copy link
Author

Hi, I deployed our server with the patch you suggested (and added a log in the error handler). I'm waiting for that log to appear, I'll let you know

@ehsannas
Copy link
Contributor

Thanks @jerzozwierz

@jerzozwierz
Copy link
Author

Hi @ehsannas, I can confirm: our server no longer crashes and I see logs indicating that our new error handler is invoked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the googleapis/nodejs-firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants