Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many dead peers when querying for hash, should eventually prune after a certain time #19

Open
t-lin opened this issue May 15, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request investigate This doesn't seem right (triage and analyze) Low Priority

Comments

@t-lin
Copy link
Member

t-lin commented May 15, 2020

Similar issue affecting ping monitor (PhysarumSM/monitoring#3)
Will fix there and propagate the fix here later.

Does not seem to be related to #3. See further comments below.

@t-lin t-lin self-assigned this May 15, 2020
@t-lin t-lin added the enhancement New feature or request label May 15, 2020
@t-lin t-lin added question Further information is requested investigate This doesn't seem right (triage and analyze) and removed question Further information is requested labels May 17, 2020
@t-lin
Copy link
Member Author

t-lin commented May 17, 2020

Upon closer look, it might not seem related to PhysarumSM/monitoring#3
LCAManager's FindService() calls FindPeers(), which calls FindProvidersAsync() from IpfsDHT

There's a note in the implementation (https://github.com/libp2p/go-libp2p-kad-dht/blob/master/routing.go#L503) that says not reading from the channel may block the query from progressing... must make sure we read from the channel ASAP. Also we could possibly time-limit the query by using a context with timeout.

Will do more investigating to see if this is still really an issue and if so, where the root cause is.

@t-lin
Copy link
Member Author

t-lin commented Jul 11, 2020

When many peers have come and left (dozens to hundreds), the DHT search just appears to slow down over time. Likely root-cause is within the implementation of libp2p's kad DHT? Could try updating to the latest version of kad DHT and see if it solves the problem, but need to develop a static test to replicate this issue first... perhaps a simple code that creates many nodes and does a search and times the increase over time.

@t-lin
Copy link
Member Author

t-lin commented Jul 12, 2020

Created separate VM with isolated network to test various scenarios, and ran many tests for each. Each test consists of curl'ing the local proxy (in client mode) and trying to get the results from the hello-world example.

In the following "fail" means it wasn't able to find an existing instance from the DHT before the proxy times out and returns. The proxy times out after 3 deployment attempts and subsequently searching for the newly spawned service.

  1. Fresh deployment of everything: search fails occasionally and then search fails consistently starting on run 116+
  2. Restarting bootstrap (after scenario 1): Does nothing, still fails consistently
  3. Restarting the LCA (after scenario 1): Does nothing, still fails consistently
  4. Restarting both bootstrap and allocator together (after scenario 1): Does nothing
  5. Restarting the proxy (after scenario 1):
    • Starts succeeding again, but fails consistently starting on run 119+
    • Same test (but with CPU profiling): starts succeeding again, but fails consistently starting on run 112+

@t-lin
Copy link
Member Author

t-lin commented Jul 12, 2020

Profiling seems to indicate it spends most of its time in SortPeers() performing the Ping() operation, which requires dialing to each peer. Refactoring to do the ping operations in parallel didn't yield any improvement.

Suspect the root-cause is likely the DHT itself when we have too many instances offering the same hash, and all but one is up. But if this is the case, wouldn't the time to result slowly increase over time? It always seems to start failing consistently only around run 110-120. Tried setting the DHT's MaxRecordAge and RoutingTableRefreshPeriod options to 1 minute, but the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request investigate This doesn't seem right (triage and analyze) Low Priority
Projects
None yet
Development

No branches or pull requests

1 participant