GH-5120: Fix SPARQL federation deadlock #5571
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GitHub issue resolved: #5120
This is an issue reported ~a year ago where RDF4J would deadlock while processing federated joins with large intermediate result sets.
I found that the issue is with BatchingServiceIteration, which is used within a federated join. When the constructor for this iteration is run, it eagerly sends out HTTP requests for the join to the other endpoint. Normally, this succeeds and the iteration results are consumed. Note that the consumer of the iteration is on the same thread as the code that starts the HTTP requests.
The assumption here (I think) is that we can send all of the requests and forget about them. However, the HTTP pool has a limited size (25 by default) and after exhausting it, we are stuck with the iteration waiting for some request to finish. However, a request can only finish if the results are consumed by the parent iteration – this can’t happen, because it’s on the same thread.
I don’t think any other operator in RDF4J does eager evaluation, so this is quite understandably very broken.
To fix this, I created a secondary queue for federated requests, and then a thread that reads from this queue and sends the request to the HTTP pool. The queue is unbounded, so this may materialize the other side of the join in memory pretty aggressively. However, (1) this was the original behavior to start with, so I'm not making this worse; (2) I don't see an easy way around this without refactoring the federated join code in a major way to allow async processing. Maybe someone has a better idea.
In any case, this does work. I ran it with the repro test case (https://github.com/tkuhn/rdf4j-timeout-test) and now the queries complete without issues, even if I issue a lot of them in parallel.
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resourcesto format from the command line)