Too many dead peers when querying for hash, should eventually prune after a certain time #19

t-lin · 2020-05-15T23:38:08Z

Similar issue affecting ping monitor (PhysarumSM/monitoring#3)
Will fix there and propagate the fix here later.

Does not seem to be related to #3. See further comments below.

t-lin · 2020-05-17T23:48:05Z

Upon closer look, it might not seem related to PhysarumSM/monitoring#3
LCAManager's FindService() calls FindPeers(), which calls FindProvidersAsync() from IpfsDHT

There's a note in the implementation (https://github.com/libp2p/go-libp2p-kad-dht/blob/master/routing.go#L503) that says not reading from the channel may block the query from progressing... must make sure we read from the channel ASAP. Also we could possibly time-limit the query by using a context with timeout.

Will do more investigating to see if this is still really an issue and if so, where the root cause is.

t-lin · 2020-07-11T22:45:10Z

When many peers have come and left (dozens to hundreds), the DHT search just appears to slow down over time. Likely root-cause is within the implementation of libp2p's kad DHT? Could try updating to the latest version of kad DHT and see if it solves the problem, but need to develop a static test to replicate this issue first... perhaps a simple code that creates many nodes and does a search and times the increase over time.

t-lin · 2020-07-12T17:54:34Z

Created separate VM with isolated network to test various scenarios, and ran many tests for each. Each test consists of curl'ing the local proxy (in client mode) and trying to get the results from the hello-world example.

In the following "fail" means it wasn't able to find an existing instance from the DHT before the proxy times out and returns. The proxy times out after 3 deployment attempts and subsequently searching for the newly spawned service.

Fresh deployment of everything: search fails occasionally and then search fails consistently starting on run 116+
Restarting bootstrap (after scenario 1): Does nothing, still fails consistently
Restarting the LCA (after scenario 1): Does nothing, still fails consistently
Restarting both bootstrap and allocator together (after scenario 1): Does nothing
Restarting the proxy (after scenario 1):
- Starts succeeding again, but fails consistently starting on run 119+
- Same test (but with CPU profiling): starts succeeding again, but fails consistently starting on run 112+

t-lin · 2020-07-12T23:45:09Z

Profiling seems to indicate it spends most of its time in SortPeers() performing the Ping() operation, which requires dialing to each peer. Refactoring to do the ping operations in parallel didn't yield any improvement.

Suspect the root-cause is likely the DHT itself when we have too many instances offering the same hash, and all but one is up. But if this is the case, wouldn't the time to result slowly increase over time? It always seems to start failing consistently only around run 110-120. Tried setting the DHT's MaxRecordAge and RoutingTableRefreshPeriod options to 1 minute, but the problem persists.

t-lin self-assigned this May 15, 2020

t-lin added the enhancement New feature or request label May 15, 2020

t-lin added question Further information is requested investigate This doesn't seem right (triage and analyze) and removed question Further information is requested labels May 17, 2020

t-lin added the Low Priority label Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many dead peers when querying for hash, should eventually prune after a certain time #19

Too many dead peers when querying for hash, should eventually prune after a certain time #19

t-lin commented May 15, 2020 •

edited

Loading

t-lin commented May 17, 2020

t-lin commented Jul 11, 2020

t-lin commented Jul 12, 2020

t-lin commented Jul 12, 2020 •

edited

Loading

Too many dead peers when querying for hash, should eventually prune after a certain time #19

Too many dead peers when querying for hash, should eventually prune after a certain time #19

Comments

t-lin commented May 15, 2020 • edited Loading

t-lin commented May 17, 2020

t-lin commented Jul 11, 2020

t-lin commented Jul 12, 2020

t-lin commented Jul 12, 2020 • edited Loading

t-lin commented May 15, 2020 •

edited

Loading

t-lin commented Jul 12, 2020 •

edited

Loading