Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubo stopped listing pin after some times #10596

Closed
3 tasks done
Mayeu opened this issue Nov 25, 2024 · 39 comments · Fixed by #10746
Closed
3 tasks done

Kubo stopped listing pin after some times #10596

Mayeu opened this issue Nov 25, 2024 · 39 comments · Fixed by #10746
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/maintainer-input Needs input from the current maintainer(s)

Comments

@Mayeu
Copy link

Mayeu commented Nov 25, 2024

Checklist

Installation method

built from source

Version

Kubo version: 0.32.1
Repo version: 16
System version: amd64/linux
Golang version: go1.23.3

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip4/0.0.0.0/udp/4001/webrtc-direct",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/webrtc-direct"
    ]
  },
  "AutoNAT": {},
  "AutoTLS": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "31TB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "DisableHTMLErrors": null,
    "ExposeRoutingAPI": null,
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": {
      "cfg": {
        "DeserializedResponses": null,
        "InlineDNSLink": null,
        "NoDNSLink": false,
        "Paths": null,
        "UseSubdomains": false
      },
      "localhost": {
        "DeserializedResponses": null,
        "InlineDNSLink": null,
        "NoDNSLink": false,
        "Paths": [
          "/ipfs"
        ],
        "UseSubdomains": false
      }
    },
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "12D3KooWAJJJwXsB5b68cbq69KpXiKqQAgTKssg76heHkg6mo2qB"
  },
  "Import": {
    "CidVersion": null,
    "HashFunction": null,
    "UnixFSChunker": null,
    "UnixFSRawLeaves": null
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 128,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": null,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": "removed for brevity"
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "24h0m0s",
    "Strategy": "all"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null,
    "Type": "auto"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "30s",
      "HighWater": 4096,
      "LowWater": 1024
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {
      "Enabled": false
    },
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Limits": {},
      "MaxMemory": "24GB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  },
  "Version": {}
}

Description

Hello,

We started to experience an issue with Kubo in our 2-node cluster where Kubo don't list pin anymore.

We have 2 nodes that both pin all the pinset we keep track of, which is around 16.39 million pins right now.

Last weeks (while we were still using 0.29), Kubo stopped responding to the /pin/ls queries sent by the cluster, those requests were hanging "indefinitely" (as in, when using curl I stopped the command after ~16h without response). Our ipfs-cluster process is returning the following in the log when this happens:

Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z        ERROR        pintracker        stateless/stateless.go:540        Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmcPpbdw8k8fyhas77WjzPMqSAgVYPonbVve3xTPtxL8ab&type=recursive": context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.328Z        ERROR        cluster        ipfs-cluster/cluster.go:2022        12D3KooTheOtherClusterNodeHash: error in broadcast response from 12D3KooTheOtherClusterNodeHash: context canceled
Nov 24 22:16:34 ipfs-01 ipfs-cluster-service[3875697]: 2024-11-24T22:16:34.447Z        ERROR        ipfshttp        ipfshttp/ipfshttp.go:1276        error posting to IPFS:Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmRSGNoqrbTsKCgHbi8xCeQEn4sQXFDjfNEUgXWG5wAg6U&type=recursive": context canceled

This started out of the blue, there was no change on the server. The issue remained after upgrading to 0.32.1.

At that time, we had the bloom filter activated, deactivating it did improve the situation for a while (maybe 24h), and then the issue started to show up again. In retrospect, I think it may not be related to the bloom filter at all).

This is the typical metrics reported by ipfs-cluster which show when Kubo stop responding to /pin/ls:

Screenshot 2024-11-25 at 10 31 12

The graph on top is the number of pins the cluster is keeping track of, and on the one on the bottom is the number of pins reported by Kubo. When restarting Kubo it generally jumps to the expected amount, and after a while it drops to 0. At that point any attempt to list pin from Kubo fails.

We only have the metrics reported by ipfs-cluster because of this Kubo bug.

The server CPU, RAM, and disk utilization is fairly low when this issue show up, so it doesn't look like it a performance issue. The only metric that started to go out of bound is the number of open file descriptors which grow and reached the 128k limit set. I bumped it to 1.28 million, but it still reaches it (with or without the bloom filter):
Screenshot 2024-11-25 at 10 31 40

The FDs limit is set both at the systemd unit level, and via IPFS_FD_MAX.

Restarting Kubo make it work again most of the time, but sometimes it doesn't change anything and it instantly starts to fail.

Here is some profiling data from one of our nodes:

More info about the system:

  • Nixos, current version of Nixpkgs being the PR that updated Kubo to 0.32.1
  • AMD Ryzen 5 Pro 3600 - 6c/12t - 3.6 GHz/4.2 GHz
  • 128 GB ECC 2666 MHz
  • 2×512 GB SSD NVMe
    • one with the system
    • the other is split and is used as logs and cache for ZFS
  • one ZFS ZRAID-0 pool with 4×6 TB HDD SATA

Kubo also emit a lot of:

INFO        net/identify        identify/id.go:457        failed negotiate identify protocol with peer        {"peer": "12D3KooWMyK8arvRjtC33rxTRZfKDcyQgZTC9yWpfMFHRbp
ngXwK", "error": "Application error 0x1 (remote): conn-31160279: system: cannot reserve inbound connection: resource limit exceeded"}

But ipfs swarm resources doesn't return anything above 5-15%, so I think this error is actually on the remote node side and not related to our issue, right?

Anything else we could gather to help solve this issue?

Right now I'm out of ideas to get our cluster back into a working state (beside restarting Kubo every 2h but that's not a solution since it will prevent us from reproving the pins to the rest of the network)

Edit with additional info:

  • Kubo is launched without the --enable-gc flag, as prescribed by ipfs-cluster doc.
@Mayeu Mayeu added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Nov 25, 2024
@11qu1d
Copy link

11qu1d commented Nov 25, 2024

I can confirm we are seeing the same behaviour while testing ipfs/kubo:v0.32.1 docker image in k8s paired with ipfs-cluster (smaller test cluster ~17k pins).

Logs from ipfs-cluster:

2024-11-25T15:01:17.201Z	ERROR	ipfshttp	ipfshttp/ipfshttp.go:1276	error posting to IPFS:Post "http://localhost:5001/api/v0/pin/ls?stream=true&type=recursive": context deadline exceeded
2024-11-25T15:01:17.201Z	ERROR	pintracker	stateless/stateless.go:364	could not get pinset from IPFS: Post "http://localhost:5001/api/v0/pin/ls?stream=true&type=recursive": context deadline exceeded
2024-11-25T15:01:17.201Z	ERROR	pintracker	stateless/stateless.go:570	could not get pinset from IPFS: Post "http://localhost:5001/api/v0/pin/ls?stream=true&type=recursive": context deadline exceeded

No errors or other logs observed in Kubo.

Also, calling the API endpoint or running ipfs pin ls --stream in Kubo just hangs until the pod is restarted. After a while it stops working again.

@hsanjuan hsanjuan self-assigned this Nov 25, 2024
@hsanjuan
Copy link
Contributor

Hello, per the details you provide, I just think leveldb exploded in some way. How many files are there in the .ipfs/leveldb folder? Chances are, it is compacting... or was compacting and it was stopped.

Do you have monitoring for disk-reads? Is it trying to read/write a lot from disk even when nothing is being added?

I would recommend to switch leveldb to pebble (so, flatfs + pebble). You will loose the pinset (not the blocks, just the list of things you have pinned) but cluster will add the pins again, in time.

@gammazero gammazero added need/author-input Needs input from the original author and removed need/triage Needs initial labeling and prioritization labels Dec 10, 2024
@Mayeu
Copy link
Author

Mayeu commented Dec 12, 2024

How many files are there in the .ipfs/leveldb folder?

$ ls -l /ipfs-data/node/datastore/*.ldb | wc -l
1917

Do you have monitoring for disk-reads? Is it trying to read/write a lot from disk even when nothing is being added?

We are running Netdata on the node so if you know of a particular metric that could be interesting I can take a look specifically at those.

In the meantime, this is the reported number of pins for the Kubo node between the 5th and 9th of December:
Screenshot 2024-12-12 at 12 27 28

Disk bandwidth at that time:
Screenshot 2024-12-12 at 12 25 01

and Disk utilization per systemd service:
Screenshot 2024-12-12 at 12 26 20

So there is less data read when Kubo is stuck, but the total number of reads & writes operations don't seem to be impacted.

I would recommend to switch leveldb to pebble (so, flatfs + pebble). You will loose the pinset (not the blocks, just the list of things you have pinned) but cluster will add the pins again, in time.

I was thinking of that since I saw that pebble was added to Kubo in a previous version.

Just to confirm the process, that would go like so:

  1. Stop Kubo
  2. Update the datastore configuration to replace the levelds store type by a pebble store
  3. delete the previous datastore data
  4. Restart the node
  5. ipfs-cluster will then ask Kubo to repin everything, which won't download anything since the data is still on disk.

@hsanjuan
Copy link
Contributor

Yes, when switching you will need to edit datastore_spec. You can also set pebble to live in a different folder (not the datastore folder used by leveldb. I would keep that folder around just in case you need to go back.

Do you use MFS for anything?

Also, regarding the ipfs-pins graph, it goes to 0 because of ipfs-cluster/ipfs-cluster#2122. From now on it will stay at the last reported amount when pin ls worked..

Even if it won't need to download data, it will need to add 16M items to the datastore, and pinning will make it traverse everything it has.

@Mayeu
Copy link
Author

Mayeu commented Dec 13, 2024

Yes, when switching you will need to edit datastore_spec. You can also set pebble to live in a different folder (not the datastore folder used by leveldb. I would keep that folder around just in case you need to go back.

Thank you, I'm going to try that today on one of our node.

Do you use MFS for anything?

No we don't use MFS, we only add new pin via the cluster API, and when needed we access our data via Kubo's gateway using the CIDs. As far as I understand this doesn't involve the MFS subsystems.

[...] the ipfs-pins graph, it goes to 0 because of ipfs-cluster/ipfs-cluster#2122. [...]

Good to know, thank you 👍

@Mayeu
Copy link
Author

Mayeu commented Dec 13, 2024

I have switch one of our node to using the pebble datastore, right now it is slowly adding back the whole pinset to pebble.

@ehsan6sha
Copy link

pebble

hi
did changing it to pebble solved the issue?

@lidel lidel added need/author-input Needs input from the original author and removed need/author-input Needs input from the original author labels Dec 17, 2024
@Mayeu
Copy link
Author

Mayeu commented Dec 18, 2024

@ehsan6sha still too soon to tell. Our node is still adding the data back into the pebble store. It has only catched up 50% of the previous pins right now.

Copy link

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@lidel lidel removed need/author-input Needs input from the original author kind/stale labels Dec 25, 2024
@lidel
Copy link
Member

lidel commented Dec 25, 2024

@Mayeu how does it look like now? (assuming it finished re-pinning)

@lidel lidel added the need/author-input Needs input from the original author label Dec 25, 2024
Copy link

github-actions bot commented Jan 1, 2025

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@Mayeu
Copy link
Author

Mayeu commented Jan 6, 2025

Hello, I'm just getting back to work after my end of the year vacation, so first, happy new year to you all :)

On the matter, it seems that something unexpected happened during my break:

Screenshot 2025-01-06 at 13 46 16

Around the 22nd of December, the pinning stopped at ~14.18M (on 16.51M at that point) and then dropped to 0 until yesterday when it spiked back to the previous point and continue to pin. Right now there is still 0.7M pin that needs to be added to the DB.

I'm currently going through our logs and I'll report back with what I can find there.

@Mayeu
Copy link
Author

Mayeu commented Jan 7, 2025

Sadly, I can't find anything because the amount log produce meant that the log before the spike up was purged 🤦🏻‍♀️

I'm updating our log retention config for that machine to keep much (much) more logs, and "hope" to see that drop again.

On the bright side, this node is now fully caught up with the cluster state, so we'll see if this issue shows up again. But since the 4th of December, our first node (which is still using LevelDB) didn't experience that issue.

@hsanjuan
Copy link
Contributor

hsanjuan commented Jan 7, 2025

As mentioned, cannot trust the graph much due to the bug I pointed above... you are better tracking the "pending" items (queued, pinning, error) and comparing that to total items in cluster pinset, rather than using this metric right now.

@hsanjuan
Copy link
Contributor

hsanjuan commented Jan 7, 2025

(and happy new year!)

@gammazero
Copy link
Contributor

@Mayeu We will wait for another week, and assume the issue is resolved if we do not hear from you.

@Mayeu
Copy link
Author

Mayeu commented Jan 8, 2025

As mentioned, cannot trust the graph much due to the bug I pointed above... you are better tracking the "pending" items (queued, pinning, error) and comparing that to total items in cluster pinset, rather than using this metric right now.

@hsanjuan right, I forgot about that. We do gather those as well.

Here they are for the past 60 days:

Pin queued:

Screenshot 2025-01-08 at 11 58 56

Error:

Screenshot 2025-01-08 at 11 48 59

Pinning:

Screenshot 2025-01-08 at 11 49 09

Just a reminder of the timeline:

  • We realized that we were encountering the issue around the 15th of November, and started to change the node configurations and thus restart the nodes regularly (one of those being switching the number of concurrent pin from 100 to 50).
  • I switched our second node to using the pebble datastore on the 13th of December.

For comparison, here are the graphs for our first node (still using LevelDBs), which didn't experience as many issues (it still encountered that listing issue, but for some reason it stabilized pretty quickly after we realized there was an issue):

Error and Pinning:

Screenshot 2025-01-08 at 11 29 45

Queued:

Screenshot 2025-01-08 at 11 30 14

@hsanjuan
Copy link
Contributor

Hello, @Mayeu .

There logs should print info messages of the sort: "Full pinset listing finished" . How long does it take to list the pinset when it works?

The ipfs-cluster config has a ipfshttp.ipfs_request_timeout option. If your errors come from cluster logs, it may be that pin ls request are timing out. Given the amount of pins that you have, it will likely take more than the default of 5 minutes.

If it is not that, and you can reproduce this by calling ipfs pin ls --type recursive directly, then we need the kubo profiles to see why it is deadlocked.

@Mayeu
Copy link
Author

Mayeu commented Jan 23, 2025

Sorry for the delay here, I finally got to check the data that was gathered last week.

The ipfs-cluster config has a ipfshttp.ipfs_request_timeout option. [...]

We have set this timeout to 20 minutes in our case.

Here is one profiling that was triggered while a pin/ls manual call was hanging for ~1min 50 after the error triggered in the log (so after the 20 min timeout): https://drive.google.com/file/d/1eFb1rHmPKDHvdP-1HEjyUvdweN-JVIvt/view?usp=sharing

At the longest some requests took 14 minutes (after the log entry) to respond. But at that point partition was full of profiles so I don't have a profile matching those request. I'll try again if you want.

There logs should print info messages of the sort: "Full pinset listing finished" . How long does it take to list the pinset when it works?

There isn't any log at the start of the listing process, right? The earliest I can find is when the process is around 500k pins.

On our node using pebble it seems to take around 1m:

Jan 22 00:48:04 ipfs-02 ipfs-cluster-service[2716182]: 2025-01-22T00:48:04.502Z        INFO        dsstate        dsstate/datastore.go:196        Full pinset listing in progress: 500000 pins so far
Jan 22 00:48:06 ipfs-02 ipfs-cluster-service[2716182]: 2025-01-22T00:48:06.554Z        INFO        dsstate        dsstate/datastore.go:196        Full pinset listing in progress: 1000000 pins so far
[...]    
Jan 22 00:49:01 ipfs-02 ipfs-cluster-service[2716182]: 2025-01-22T00:49:01.603Z        INFO        dsstate        dsstate/datastore.go:201        Full pinset listing finished: 16630186 pins

On our node still using leveldb, this takes around 45s:

Jan 23 10:20:42 ipfs-cluster-service[3941117]: 2025-01-23T10:20:42.921Z        INFO        dsstate        dsstate/datastore.go:196        Full pinset listing in progress: 500000 pins so far
Jan 23 10:20:44 ipfs-01 ipfs-cluster-service[3941117]: 2025-01-23T10:20:44.117Z        INFO        dsstate        dsstate/datastore.go:196        Full pinset listing in progress: 1000000 pins so far [...]
Jan 23 10:21:24 ipfs-01 ipfs-cluster-service[3941117]: 2025-01-23T10:21:24.420Z        INFO        dsstate        dsstate/datastore.go:201        Full pinset listing finished: 16633864 pins

@hsanjuan
Copy link
Contributor

I requested access to the profile. If you manage to get another manual ipfs pin ls --stream true --type=recursive hanging for like 10 minutes, that would be good too.

@Mayeu
Copy link
Author

Mayeu commented Jan 28, 2025

Ha sorry I thought that I shared a public link, I updated the share right to be for anyone with the link.

I'll start the script again to try to get a profile for a request that hangs for longer.

@hsanjuan
Copy link
Contributor

@Mayeu the goroutine.stacks file shows 218 goroutines calling PinLs(cid?), they have all been waiting for 2 minutes for read access.

There is a goroutine listing pins at the same time.

And there are 5 waiting for a write Lock to pin something, which they cannot get while the listing is ongoing.

I don't see anything weird at the pebble level, it seems to be reading as requested by pinLs. Do you think the ongoing pin/ls requests is returning any data, or just hanging? logs showing on cluster?

Copy link

github-actions bot commented Feb 4, 2025

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@Mayeu
Copy link
Author

Mayeu commented Feb 5, 2025

they have all been waiting for 2 minutes for read access.

That surprising, when a log entry triggers the cluster has already been waiting for its 20 min timeout. Or maybe I'm missing something ?

Do you think the ongoing pin/ls requests is returning any data, or just hanging?

The few times I witnessed the issue directly no data was returned for a while it was just hanging waiting for a response from Kubo.

"a while" spanning between 30s to a few hour with the Pebble store, and more than 24h when we initially had the issue with LevelDB.

logs showing on cluster?

At time when this happens, the cluster only shows the "context cancelled" error:

error posting to IPFS:Post "http://127.0.0.1:5001/api/v0/pin/ls?stream=true&arg=QmRSGNoqrbTsKCgHbi8xCeQEn4sQXFDjfNEUgXWG5wAg6U&type=recursive": context canceled

I restarted the script to get a profile when this error shows up in the log. I improved it a bit because multiple profiles could have been run at the same time previously, and maybe that could have influenced the running time of pin/ls?. I'll check tomorrow and report back.

Sidenote: both nodes have been upgraded to 0.33.0, I'm waiting for 0.33.1 to hit nixpkgs to upgrade to that version.

@hsanjuan
Copy link
Contributor

Any news @Mayeu ?

@Mayeu
Copy link
Author

Mayeu commented Feb 14, 2025

Hello,

For the past few days the lock up were shorter than during my previous data gathering tentative, the longest I got was this profile, during which a manual call to pin/ls was hanging for a bit more than 9 mins based on the output of the script.

@guillaumemichel guillaumemichel added need/maintainer-input Needs input from the current maintainer(s) and removed need/author-input Needs input from the original author labels Feb 18, 2025
@hsanjuan
Copy link
Contributor

I think I have something... The reprovider is running, and apparently very slowly since it is waiting to send pins for 284 minutes.

It is holding a read lock on the dspinner. However, there are also some writers waiting to write on the same lock (doPinRecursive). This prevents more read-locks to be granted (judging by what it says here https://stackoverflow.com/questions/46786900/the-implementions-of-rwmutex-lock-in-golang, which matches what the stacks show).

As a result, everything hangs. I will study how we can address this, given that you already have accelerateddht providing enabled (I assume you need dht providing for every block, otherwise best to disable or change strategy to "roots"). At your pinset size, dht providing IS a bottleneck.

@hsanjuan
Copy link
Contributor

Also explained here: https://pkg.go.dev/sync#RWMutex

@Mayeu
Copy link
Author

Mayeu commented Feb 25, 2025

Thank you for spending your time on this 🙏

I assume you need dht providing for every block, otherwise best to disable or change strategy to "roots"

Good insight, I don't think we wish to do the reproviding at the block level.

For CIDs that points to a file, we definitely don't care about reproviding at the block level, the roots will be enough.

But how would "roots" behave for CIDs that point to a folder? If somebody requests only one file in that folder, will our server still answer that request or will it only answer requests to get the whole folder?

@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 3, 2025

But how would "roots" behave for CIDs that point to a folder? If somebody requests only one file in that folder, will our server still answer that request or will it only answer requests to get the whole folder?

It could be that, if you are only announcing the folder's CID and people request a file by file's CID, that no routing to that file is discovered. If people request <folder_cid>/filename.jpg things would likely work.

If the reporovider does not announce a CID in particular because it is set to roots, then things may still work if discovery happens via bitswap. For that to happen, the clients looking for the content need to be connected to the server in advance so that they ask it directly for content.

hsanjuan added a commit to ipfs/boxo that referenced this issue Mar 5, 2025
This provider helper allows to buffer all the results from a keyChanFunc in memory.

The purpose is to fix issues with slow re-providing. In the case of
ipfs/kubo#10596, the slow re-providing process
causes starvation of any operations trying to read/write to the pinset.

With the new buffered KeyChanFunc, we can read everything we need to announce
into memory first, and free any locks as soon as possible.

Given the compact size of CIDs (<50bytes), I don't think any complexer approach (batch reading and lock/unlock handling) is justified. People with pinsets of 20 million items that want to announce everything can well spare an extra GB of RAM. For smaller repos the impact becomes negligible.

The test targets precisely the use-case and ensures we don't starve operations while interacting with dspinner.
hsanjuan added a commit that referenced this issue Mar 6, 2025
Fixes #10596.

The reproviding process can take long. Currently, each CID to be provided is
obtained by making a query to the pinner and reading one by one as the CIDs
get provided.

While this query is ongoing, the pinner holds a Read mutex on the pinset.

If a pin-add-request arrives, a goroutine will start waiting for a Write mutex
on the pinset. From that point, no new Read mutexes can be taken until the writer
can proceed and finishes.

However, no one can proceed because the read mutex is still held while the
reproviding is ongoing.

The fix is mostly in Boxo, where we add a "buffered" provider which reads the
cids onto memory so that they can be provided at its own pace without making
everyone wait.

The consequence is we will need more RAM memory. Rule of thumb is 1GiB extra per 20M cids to be reprovided.
@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 7, 2025

There's a fix at #10746 and it would be good if someone can confirm improvements. I am testing myself as well on our clusters.

@hsanjuan
Copy link
Contributor

Fix now on master

@Mayeu
Copy link
Author

Mayeu commented Mar 12, 2025

@hsanjuan Hello, sorry I was on vacation last week and I'm only checking this issue right now. I'll try the master branch on our second server and report back any improvement in the number of reported errors 👍

lidel pushed a commit that referenced this issue Mar 14, 2025
Fixes #10596.

The reproviding process can take long. Currently, each CID to be provided is
obtained by making a query to the pinner and reading one by one as the CIDs
get provided.

While this query is ongoing, the pinner holds a Read mutex on the pinset.

If a pin-add-request arrives, a goroutine will start waiting for a Write mutex
on the pinset. From that point, no new Read mutexes can be taken until the writer
can proceed and finishes.

However, no one can proceed because the read mutex is still held while the
reproviding is ongoing.

The fix is mostly in Boxo, where we add a "buffered" provider which reads the
cids onto memory so that they can be provided at its own pace without making
everyone wait.

The consequence is we will need more RAM memory. Rule of thumb is 1GiB extra per 20M cids to be reprovided.

(cherry picked from commit ba22102)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/maintainer-input Needs input from the current maintainer(s)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants