Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebRTCTransport.dial AbortError #2702

Closed
christroutner opened this issue Sep 13, 2024 · 17 comments · Fixed by #2986, #2987 or Permissionless-Software-Foundation/helia-coord#57
Closed

WebRTCTransport.dial AbortError #2702

christroutner opened this issue Sep 13, 2024 · 17 comments · Fixed by #2986, #2987 or Permissionless-Software-Foundation/helia-coord#57
Labels
need/author-input Needs input from the original author

Comments

@christroutner
Copy link

christroutner commented Sep 13, 2024

  • Version:

  • libp2p v1.9.1

  • Platform:

  • Linux hp-elitedesk01 5.15.0-91-generic Create CODE_OF_CONDUCT.md #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • Subsystem:

  • WebRTC

Severity:

  • Critical - System crash, application panic.

Description:

I had filed this previous issue about issues I was having with the @libp2p/webrtc package. That was resolved and the current package versions can be seen here and the code for initializing libp2p can be found here.

I'm now encountering what appears to be a race condition inside the webRTC libraries. The node will run for a while and then randomly will crash with the following error message:

file:///home/safeuser/ipfs-service-provider/node_modules/race-signal/dist/src/index.js:22
        return Promise.reject(new AbortError(opts?.errorMessage, opts?.errorCode, opts?.errorName));
                              ^


AbortError: The operation was aborted
    at raceSignal (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/race-signal/dist/src/index.js:22:31)
    at YamuxStream.closeWrite (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/abstract-stream.js:230:19)
    at YamuxStream.close (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/abstract-stream.js:189:18)
    at file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection/index.js:118:63
    at Array.map (<anonymous>)
    at ConnectionImpl.close (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection/index.js:118:44)
    at initiateConnection (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/webrtc/dist/src/private-to-private/initiate-connection.js:146:34)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async WebRTCTransport.dial (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/webrtc/dist/src/private-to-private/transport.js:93:65)
    at async DefaultTransportManager.dial (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/transport-manager.js:87:20)
    at async queue.add.peerId.peerId [as fn] (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection-manager/dial-queue.js:168:38)
    at async raceSignal (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/race-signal/dist/src/index.js:28:16)
    at async Job.run (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/queue/job.js:55:28) {
  type: 'aborted',
  code: 'ABORT_ERR'
}

Node.js v20.17.0

Steps to reproduce the error:

The error does not occur right away. It will appear at some point within 30 minutes while the node is running. It forces the app to crash and the process manager will restart it. But then the crash will happen again within 30 minutes.

@christroutner
Copy link
Author

This might be the same issue I reported in #2462. I'll take a closer look by replacing my node_modules and package-lock.json files and report back here.

However, I don't think that this is the same, as I'm building the application into a docker container with the --no-cache flag. It should be installing the node_modules folder from scratch. ..but the package-lock.json file would be copied from the repository. So maybe that is the issue.

I'll report back on my findings.

@christroutner
Copy link
Author

christroutner commented Sep 17, 2024

I carefully deleted my node_modules folder and package-lock.json file before installing dependencies and I'm still getting the above error. As far as I can see it does not have anything to do with an unclean install as was claimed in #2462.

The main target that I'm testing is a libp2p node setup as a Circuit Relay server.

@Chomtana
Copy link

TURN works for regular internet connections across countries without this error, but it doesn't function properly with restrictive VPNs. This error indicates that WebRTC has failed to establish a connection with the peer.

@christroutner
Copy link
Author

I wouldn't mind if webRTC fails to connect, but this error causes the application to crash and exit, and there doesn't seem to be any way to wrap it with try/catch to handle the exception.

@cristianmadularu
Copy link

cristianmadularu commented Oct 24, 2024

This is happening for us as well causing our Node processes to crash.

image

@cristianmadularu
Copy link

cristianmadularu commented Oct 24, 2024

I wouldn't mind if webRTC fails to connect, but this error causes the application to crash and exit, and there doesn't seem to be any way to wrap it with try/catch to handle the exception.

image

@christroutner while this is not a 'solution' (more of a temporary workaround), you might consider an application level handler and consider not allowing the application to crash if that type of exception goes unhandled...
Risky approach since there is no guarantee that the app is still in a good state... but... an ugly workaround nevertheless.... until this gets fixed.

@christroutner
Copy link
Author

I appreciate the tip @cristianmadularu.

I ended up just disabling WebRTC in my application until this issue can be resolved. It would be great to have, but it's not a core requirement.

@silkroadnomad
Copy link

@christroutner if I remove WebRTC on NodeJS from transports, is circuit-relay and autonat, dcutr for browsers to connect peer-to-peer (via WebRTC) to each other still possible when coming both via wss or webtransport?

@christroutner
Copy link
Author

christroutner commented Jan 26, 2025

My understanding is that if you remove WebRTC, then circuit-relay is not possible. I don't know much about the other protocols mentioned in your question.

@achingbrain
Copy link
Member

achingbrain commented Feb 3, 2025

Browsers can listen on circuit relay addresses where the relayed connection is established over WebSockets/WebTransport, but any incoming connections will be time/data limited so it's only useful under certain conditions.

For two browsers to upgrade to an unlimited direct connection you need WebRTC.

@achingbrain achingbrain added need/analysis Needs further analysis before proceeding and removed need/triage Needs initial labeling and prioritization labels Feb 4, 2025
achingbrain added a commit to achingbrain/ipfs-service-provider that referenced this issue Feb 20, 2025
While investigating libp2p/js-libp2p#2702
I've had this running for almost 12 hours without a crash.

The only changes I've made are to upgrade the libp2p/Helia deps and
to enable the WebRTC/WebRTC Direct transports and add a WebRTC Direct
listener.

This PR just upgrades the Helia/libp2p deps.
achingbrain added a commit to achingbrain/ipfs-service-provider that referenced this issue Feb 20, 2025
While investigating libp2p/js-libp2p#2702
I've had this running for almost 12 hours without a crash.

The only changes I've made are to upgrade the libp2p/Helia deps and
to enable the WebRTC/WebRTC Direct transports and add a WebRTC Direct
listener.

This PR just upgrades the Helia/libp2p deps.
@achingbrain
Copy link
Member

Steps to reproduce the error:

The error does not occur right away. It will appear at some point within 30 minutes while the node is running. It forces the app to crash and the process manager will restart it. But then the crash will happen again within 30 minutes.

@christroutner I've been running the ipfs-service-provider all day and haven't seen a single crash.

The deps were quite out of date so that might have something to do with it. I've [opened a PR](Permissionless-Software-Foundation/ipfs-service-provider#168 that updates them.

I will open a followup with my changes that re-add WebRTC support.

@achingbrain achingbrain added need/author-input Needs input from the original author and removed need/analysis Needs further analysis before proceeding labels Feb 20, 2025
@achingbrain
Copy link
Member

Here is the followup that re-enables WebRTC - Permissionless-Software-Foundation/ipfs-service-provider#169

@christroutner
Copy link
Author

christroutner commented Feb 20, 2025

This is just the prod I needed. Thanks @achingbrain. I've been intending to update this thread the last few days.

I updated ipfs-service-provider to use helia v5.2.0, libp2p v2.6.2, and @libp2p/webrtc v5.1.0. The WebRTC and Circuit Relay stuff is working much better.

However, I'm still seeing the random AbortError. Sometimes it happens right after startup, sometimes it doesn't happen for hours. It seems completely random (which makes me think the root cause is a race condition).

I have however managed to catch it by adding this code snippet to the first JS file to get executed:

process.on('unhandledRejection', (reason, promise) => {
  console.log(`Handling ${reason.code} error`)
})

That at least prevents it from crashing the entire app. There does not appear to be any negative side effects to handling the error as above. I still can't seem to find the root cause, but it seems to be the same issue.

I'll update the code to print out the error and I'll try to add it to this thread, to see if the error stack has changed at all.

In the meantime, I'll review your PR and compare it to the changes I've already made.

@christroutner
Copy link
Author

After updating all npm dependencies, I'm still seeing the AbortError randomly. Here is the stack from the latest error:

Handling ABORT_ERR error. stack:  AbortError: The operation was aborted
    at raceSignal (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/race-signal/dist/src/index.js:22:31)
    at YamuxStream.closeWrite (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/abstract-stream.js:231:19)
    at YamuxStream.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/abstract-stream.js:190:18)
    at stream.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/stream-to-ma-conn.js:15:15)
    at ConnectionImpl.close [as _close] (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/upgrader.js:426:30)
    at async ConnectionImpl.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/connection/index.js:118:13)
    at async initiateConnection (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/webrtc/dist/src/private-to-private/initiate-connection.js:148:17)
    at async WebRTCTransport.dial (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/webrtc/dist/src/private-to-private/transport.js:92:65)
    at async queue.add.peerId.peerId [as fn] (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/connection-manager/dial-queue.js:173:38) {
  type: 'aborted',
  code: 'ABORT_ERR'
}

@achingbrain
Copy link
Member

achingbrain commented Feb 21, 2025

Do you know what the multiaddr is that your node is trying to dial? That might help me narrow it down a bit.

@christroutner
Copy link
Author

christroutner commented Feb 21, 2025

Do you know what the multiaddr is that your node is trying to dial? That might help me narrow it down a bit.

No, not at the time the error occurs.

At a high level, when a new node is trying to connect to the network, it first connects to a handful of bootstrap nodes. It listens on an 'announcement' pubsub channel. When a new node announces itself that it hasn't seen, the announcement object contains multiaddrs. The node will go down the list of multiaddrs and try to connect to each multiaddr until it's successful or reaches the end of the list.

Also, a timer will kick off every few minutes to try and connect to nodes it knows about and hasn't been able to connect to.

So the stage is set for a race condition. Everything is jumbled in production.

If the error was thrown within the code path, it would be caught. I would know exactly where in the code path the error happened and exactly which node and which transport it was using. But because this is manifesting as an AbortError that I have to catch in a general way, I can't isolate exactly what is causing the error. And there is no info in the stack to help me isolate the code path within my own app.

@achingbrain
Copy link
Member

Ok, I think I've figured out what's happening.

  1. A new connection to a WebRTC address is initiated
  2. The dialing peer dials the relay and opens a new connection (e.g. one did not exist before)
  3. Doing the SDP handhsake times out and the abort signal fires its "abort" event and is now aborted
  4. The dialing peer gives up and closes the connection
  5. The stream muxer closes all streams on the connection
  6. Each stream races closing the read and write ends of the stream against the (aborted) signal
  7. race-signal notices the signal is aborted and immediately returns a rejection
  8. The .closeWrite method rejects due also using the (aborted) signal
  9. The promise that the muxer is racing against the (aborted) signal has nothing awaiting it and so 💥

This should be fixed by achingbrain/race-signal#64 released in [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author
Projects
None yet
5 participants