-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many dead peers when querying for hash, should eventually prune after a certain time #19
Comments
Upon closer look, it might not seem related to PhysarumSM/monitoring#3 There's a note in the implementation (https://github.com/libp2p/go-libp2p-kad-dht/blob/master/routing.go#L503) that says not reading from the channel may block the query from progressing... must make sure we read from the channel ASAP. Also we could possibly time-limit the query by using a context with timeout. Will do more investigating to see if this is still really an issue and if so, where the root cause is. |
When many peers have come and left (dozens to hundreds), the DHT search just appears to slow down over time. Likely root-cause is within the implementation of libp2p's kad DHT? Could try updating to the latest version of kad DHT and see if it solves the problem, but need to develop a static test to replicate this issue first... perhaps a simple code that creates many nodes and does a search and times the increase over time. |
Created separate VM with isolated network to test various scenarios, and ran many tests for each. Each test consists of curl'ing the local proxy (in client mode) and trying to get the results from the In the following "fail" means it wasn't able to find an existing instance from the DHT before the proxy times out and returns. The proxy times out after 3 deployment attempts and subsequently searching for the newly spawned service.
|
Profiling seems to indicate it spends most of its time in Suspect the root-cause is likely the DHT itself when we have too many instances offering the same hash, and all but one is up. But if this is the case, wouldn't the time to result slowly increase over time? It always seems to start failing consistently only around run 110-120. Tried setting the DHT's MaxRecordAge and RoutingTableRefreshPeriod options to 1 minute, but the problem persists. |
Similar issue affecting ping monitor (PhysarumSM/monitoring#3)Will fix there and propagate the fix here later.
Does not seem to be related to #3. See further comments below.
The text was updated successfully, but these errors were encountered: