WebRTCTransport.dial AbortError #2702

christroutner · 2024-09-13T16:49:36Z

Version:
libp2p v1.9.1
Platform:
Linux hp-elitedesk01 5.15.0-91-generic Create CODE_OF_CONDUCT.md #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Subsystem:
WebRTC

Severity:

Critical - System crash, application panic.

Description:

I had filed this previous issue about issues I was having with the @libp2p/webrtc package. That was resolved and the current package versions can be seen here and the code for initializing libp2p can be found here.

I'm now encountering what appears to be a race condition inside the webRTC libraries. The node will run for a while and then randomly will crash with the following error message:

file:///home/safeuser/ipfs-service-provider/node_modules/race-signal/dist/src/index.js:22
        return Promise.reject(new AbortError(opts?.errorMessage, opts?.errorCode, opts?.errorName));
                              ^


AbortError: The operation was aborted
    at raceSignal (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/race-signal/dist/src/index.js:22:31)
    at YamuxStream.closeWrite (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/abstract-stream.js:230:19)
    at YamuxStream.close (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/abstract-stream.js:189:18)
    at file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection/index.js:118:63
    at Array.map (<anonymous>)
    at ConnectionImpl.close (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection/index.js:118:44)
    at initiateConnection (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/webrtc/dist/src/private-to-private/initiate-connection.js:146:34)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async WebRTCTransport.dial (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/webrtc/dist/src/private-to-private/transport.js:93:65)
    at async DefaultTransportManager.dial (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/transport-manager.js:87:20)
    at async queue.add.peerId.peerId [as fn] (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/libp2p/dist/src/connection-manager/dial-queue.js:168:38)
    at async raceSignal (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/race-signal/dist/src/index.js:28:16)
    at async Job.run (file:///home/safeuser/ipfs-bch-wallet-service/node_modules/@libp2p/utils/dist/src/queue/job.js:55:28) {
  type: 'aborted',
  code: 'ABORT_ERR'
}

Node.js v20.17.0

Steps to reproduce the error:

The error does not occur right away. It will appear at some point within 30 minutes while the node is running. It forces the app to crash and the process manager will restart it. But then the crash will happen again within 30 minutes.

The text was updated successfully, but these errors were encountered:

christroutner · 2024-09-14T21:39:45Z

This might be the same issue I reported in #2462. I'll take a closer look by replacing my node_modules and package-lock.json files and report back here.

However, I don't think that this is the same, as I'm building the application into a docker container with the --no-cache flag. It should be installing the node_modules folder from scratch. ..but the package-lock.json file would be copied from the repository. So maybe that is the issue.

I'll report back on my findings.

christroutner · 2024-09-17T19:12:39Z

I carefully deleted my node_modules folder and package-lock.json file before installing dependencies and I'm still getting the above error. As far as I can see it does not have anything to do with an unclean install as was claimed in #2462.

The main target that I'm testing is a libp2p node setup as a Circuit Relay server.

Chomtana · 2024-09-20T08:09:54Z

TURN works for regular internet connections across countries without this error, but it doesn't function properly with restrictive VPNs. This error indicates that WebRTC has failed to establish a connection with the peer.

christroutner · 2024-09-20T15:59:34Z

I wouldn't mind if webRTC fails to connect, but this error causes the application to crash and exit, and there doesn't seem to be any way to wrap it with try/catch to handle the exception.

cristianmadularu · 2024-10-24T14:52:03Z

This is happening for us as well causing our Node processes to crash.

cristianmadularu · 2024-10-24T23:44:20Z

I wouldn't mind if webRTC fails to connect, but this error causes the application to crash and exit, and there doesn't seem to be any way to wrap it with try/catch to handle the exception.

@christroutner while this is not a 'solution' (more of a temporary workaround), you might consider an application level handler and consider not allowing the application to crash if that type of exception goes unhandled...
Risky approach since there is no guarantee that the app is still in a good state... but... an ugly workaround nevertheless.... until this gets fixed.

christroutner · 2024-10-25T15:23:37Z

I appreciate the tip @cristianmadularu.

I ended up just disabling WebRTC in my application until this issue can be resolved. It would be great to have, but it's not a core requirement.

silkroadnomad · 2025-01-25T17:46:28Z

@christroutner if I remove WebRTC on NodeJS from transports, is circuit-relay and autonat, dcutr for browsers to connect peer-to-peer (via WebRTC) to each other still possible when coming both via wss or webtransport?

christroutner · 2025-01-26T02:25:32Z

My understanding is that if you remove WebRTC, then circuit-relay is not possible. I don't know much about the other protocols mentioned in your question.

achingbrain · 2025-02-03T13:05:00Z

Browsers can listen on circuit relay addresses where the relayed connection is established over WebSockets/WebTransport, but any incoming connections will be time/data limited so it's only useful under certain conditions.

For two browsers to upgrade to an unlimited direct connection you need WebRTC.

While investigating libp2p/js-libp2p#2702 I've had this running for almost 12 hours without a crash. The only changes I've made are to upgrade the libp2p/Helia deps and to enable the WebRTC/WebRTC Direct transports and add a WebRTC Direct listener. This PR just upgrades the Helia/libp2p deps.

achingbrain · 2025-02-20T17:45:55Z

Steps to reproduce the error:

The error does not occur right away. It will appear at some point within 30 minutes while the node is running. It forces the app to crash and the process manager will restart it. But then the crash will happen again within 30 minutes.

@christroutner I've been running the ipfs-service-provider all day and haven't seen a single crash.

The deps were quite out of date so that might have something to do with it. I've [opened a PR](Permissionless-Software-Foundation/ipfs-service-provider#168 that updates them.

I will open a followup with my changes that re-add WebRTC support.

achingbrain · 2025-02-20T17:53:50Z

Here is the followup that re-enables WebRTC - Permissionless-Software-Foundation/ipfs-service-provider#169

christroutner · 2025-02-20T18:34:57Z

This is just the prod I needed. Thanks @achingbrain. I've been intending to update this thread the last few days.

I updated ipfs-service-provider to use helia v5.2.0, libp2p v2.6.2, and @libp2p/webrtc v5.1.0. The WebRTC and Circuit Relay stuff is working much better.

However, I'm still seeing the random AbortError. Sometimes it happens right after startup, sometimes it doesn't happen for hours. It seems completely random (which makes me think the root cause is a race condition).

I have however managed to catch it by adding this code snippet to the first JS file to get executed:

process.on('unhandledRejection', (reason, promise) => {
  console.log(`Handling ${reason.code} error`)
})

That at least prevents it from crashing the entire app. There does not appear to be any negative side effects to handling the error as above. I still can't seem to find the root cause, but it seems to be the same issue.

I'll update the code to print out the error and I'll try to add it to this thread, to see if the error stack has changed at all.

In the meantime, I'll review your PR and compare it to the changes I've already made.

christroutner · 2025-02-21T04:55:32Z

After updating all npm dependencies, I'm still seeing the AbortError randomly. Here is the stack from the latest error:

Handling ABORT_ERR error. stack:  AbortError: The operation was aborted
    at raceSignal (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/race-signal/dist/src/index.js:22:31)
    at YamuxStream.closeWrite (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/abstract-stream.js:231:19)
    at YamuxStream.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/abstract-stream.js:190:18)
    at stream.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/utils/dist/src/stream-to-ma-conn.js:15:15)
    at ConnectionImpl.close [as _close] (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/upgrader.js:426:30)
    at async ConnectionImpl.close (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/connection/index.js:118:13)
    at async initiateConnection (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/webrtc/dist/src/private-to-private/initiate-connection.js:148:17)
    at async WebRTCTransport.dial (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/@libp2p/webrtc/dist/src/private-to-private/transport.js:92:65)
    at async queue.add.peerId.peerId [as fn] (file:///home/trout/work/psf/code/ipfs-service-provider/node_modules/libp2p/dist/src/connection-manager/dial-queue.js:173:38) {
  type: 'aborted',
  code: 'ABORT_ERR'
}

achingbrain · 2025-02-21T13:13:47Z

Do you know what the multiaddr is that your node is trying to dial? That might help me narrow it down a bit.

christroutner · 2025-02-21T15:06:54Z

Do you know what the multiaddr is that your node is trying to dial? That might help me narrow it down a bit.

No, not at the time the error occurs.

At a high level, when a new node is trying to connect to the network, it first connects to a handful of bootstrap nodes. It listens on an 'announcement' pubsub channel. When a new node announces itself that it hasn't seen, the announcement object contains multiaddrs. The node will go down the list of multiaddrs and try to connect to each multiaddr until it's successful or reaches the end of the list.

Also, a timer will kick off every few minutes to try and connect to nodes it knows about and hasn't been able to connect to.

So the stage is set for a race condition. Everything is jumbled in production.

If the error was thrown within the code path, it would be caught. I would know exactly where in the code path the error happened and exactly which node and which transport it was using. But because this is manifesting as an AbortError that I have to catch in a general way, I can't isolate exactly what is causing the error. And there is no info in the stack to help me isolate the code path within my own app.

achingbrain · 2025-02-21T16:48:26Z

Ok, I think I've figured out what's happening.

A new connection to a WebRTC address is initiated
The dialing peer dials the relay and opens a new connection (e.g. one did not exist before)
Doing the SDP handhsake times out and the abort signal fires its "abort" event and is now aborted
The dialing peer gives up and closes the connection
The stream muxer closes all streams on the connection
Each stream races closing the read and write ends of the stream against the (aborted) signal
race-signal notices the signal is aborted and immediately returns a rejection
The .closeWrite method rejects due also using the (aborted) signal
The promise that the muxer is racing against the (aborted) signal has nothing awaiting it and so 💥

This should be fixed by achingbrain/race-signal#64 released in [email protected]

christroutner added the need/triage Needs initial labeling and prioritization label Sep 13, 2024

This was referenced Sep 13, 2024

Tracking Issue - WebRTC direct in Node.js #2581

Closed

fix(webrtc): Disabling webRTC due to bug Permissionless-Software-Foundation/ipfs-service-provider#162

Merged

silkroadnomad mentioned this issue Nov 20, 2024

Issues with Yamux / race-signal ipfs/helia#554

Closed

achingbrain added need/analysis Needs further analysis before proceeding and removed need/triage Needs initial labeling and prioritization labels Feb 4, 2025

anhnd350309 mentioned this issue Feb 13, 2025

libp2p yamux errors drp-tech/ts-drp#260

Closed

achingbrain mentioned this issue Feb 19, 2025

fix: build as ESM not bundled CJS dozyio/gossipsub-simulator#2

Merged

achingbrain mentioned this issue Feb 20, 2025

deps: update libp2p/helia to v2 and v5 respectively Permissionless-Software-Foundation/ipfs-service-provider#168

Closed

achingbrain added need/author-input Needs input from the original author and removed need/analysis Needs further analysis before proceeding labels Feb 20, 2025

achingbrain mentioned this issue Feb 21, 2025

fix: update race-signal #2986

Merged

3 tasks

achingbrain closed this as completed in 2a3cec9 Feb 21, 2025

achingbrain closed this as completed in #2986 Feb 21, 2025

achingbrain mentioned this issue Feb 21, 2025

chore: release main #2987

Merged

This was referenced Feb 23, 2025

Updating dependencies and removing debugging logs. Permissionless-Software-Foundation/helia-coord#57

Merged

Helia v5 Permissionless-Software-Foundation/ipfs-service-provider#170

Merged

achingbrain mentioned this issue Feb 27, 2025

Where is this error being thrown? #2315

Closed

silkroadnomad mentioned this issue Mar 13, 2025

Voyager App Crash After Prolonged Runtime orbitdb/voyager#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebRTCTransport.dial AbortError #2702

WebRTCTransport.dial AbortError #2702

christroutner commented Sep 13, 2024 •

edited

Loading

christroutner commented Sep 14, 2024

christroutner commented Sep 17, 2024 •

edited

Loading

Chomtana commented Sep 20, 2024

christroutner commented Sep 20, 2024

cristianmadularu commented Oct 24, 2024 •

edited

Loading

cristianmadularu commented Oct 24, 2024 •

edited

Loading

christroutner commented Oct 25, 2024

silkroadnomad commented Jan 25, 2025

christroutner commented Jan 26, 2025 •

edited

Loading

achingbrain commented Feb 3, 2025 •

edited

Loading

achingbrain commented Feb 20, 2025

achingbrain commented Feb 20, 2025

christroutner commented Feb 20, 2025 •

edited

Loading

christroutner commented Feb 21, 2025

achingbrain commented Feb 21, 2025 •

edited

Loading

christroutner commented Feb 21, 2025 •

edited

Loading

achingbrain commented Feb 21, 2025

WebRTCTransport.dial AbortError #2702

WebRTCTransport.dial AbortError #2702

Comments

christroutner commented Sep 13, 2024 • edited Loading

Severity:

Description:

Steps to reproduce the error:

christroutner commented Sep 14, 2024

christroutner commented Sep 17, 2024 • edited Loading

Chomtana commented Sep 20, 2024

christroutner commented Sep 20, 2024

cristianmadularu commented Oct 24, 2024 • edited Loading

cristianmadularu commented Oct 24, 2024 • edited Loading

christroutner commented Oct 25, 2024

silkroadnomad commented Jan 25, 2025

christroutner commented Jan 26, 2025 • edited Loading

achingbrain commented Feb 3, 2025 • edited Loading

achingbrain commented Feb 20, 2025

achingbrain commented Feb 20, 2025

christroutner commented Feb 20, 2025 • edited Loading

christroutner commented Feb 21, 2025

achingbrain commented Feb 21, 2025 • edited Loading

christroutner commented Feb 21, 2025 • edited Loading

achingbrain commented Feb 21, 2025

christroutner commented Sep 13, 2024 •

edited

Loading

christroutner commented Sep 17, 2024 •

edited

Loading

cristianmadularu commented Oct 24, 2024 •

edited

Loading

cristianmadularu commented Oct 24, 2024 •

edited

Loading

christroutner commented Jan 26, 2025 •

edited

Loading

achingbrain commented Feb 3, 2025 •

edited

Loading

christroutner commented Feb 20, 2025 •

edited

Loading

achingbrain commented Feb 21, 2025 •

edited

Loading

christroutner commented Feb 21, 2025 •

edited

Loading