Skip to content

Conversation

@Ostrzyciel
Copy link
Contributor

GitHub issue resolved: #5120

This is an issue reported ~a year ago where RDF4J would deadlock while processing federated joins with large intermediate result sets.

I found that the issue is with BatchingServiceIteration, which is used within a federated join. When the constructor for this iteration is run, it eagerly sends out HTTP requests for the join to the other endpoint. Normally, this succeeds and the iteration results are consumed. Note that the consumer of the iteration is on the same thread as the code that starts the HTTP requests.

The assumption here (I think) is that we can send all of the requests and forget about them. However, the HTTP pool has a limited size (25 by default) and after exhausting it, we are stuck with the iteration waiting for some request to finish. However, a request can only finish if the results are consumed by the parent iteration – this can’t happen, because it’s on the same thread.

I don’t think any other operator in RDF4J does eager evaluation, so this is quite understandably very broken.

To fix this, I created a secondary queue for federated requests, and then a thread that reads from this queue and sends the request to the HTTP pool. The queue is unbounded, so this may materialize the other side of the join in memory pretty aggressively. However, (1) this was the original behavior to start with, so I'm not making this worse; (2) I don't see an easy way around this without refactoring the federated join code in a major way to allow async processing. Maybe someone has a better idea.

In any case, this does work. I ran it with the repro test case (https://github.com/tkuhn/rdf4j-timeout-test) and now the queries complete without issues, even if I issue a lot of them in parallel.


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@Ostrzyciel
Copy link
Contributor Author

@hmottestad could you have a look again? I think all issues are solved, but I left them open just in case.

@Ostrzyciel
Copy link
Contributor Author

We've found a bug in this... drafting this for investigation. I will also work on adding tests.

@Ostrzyciel Ostrzyciel force-pushed the GH-5120-fix-federation-deadlock branch from 18dc62e to b0cd9a6 Compare December 12, 2025 20:10
@Ostrzyciel Ostrzyciel marked this pull request as ready for review December 12, 2025 20:11
@Ostrzyciel
Copy link
Contributor Author

Ostrzyciel commented Dec 12, 2025

@hmottestad we've found a bug in the previous solution, it terminated the iteration too early and we were getting incomplete results... that's completely my fault for not checking the output carefully.

Now, when running the repro case (https://github.com/tkuhn/rdf4j-timeout-test) we are getting the full results (30 rows for query 1, 1000 for query 2).

This solution is even simpler, because we simply send tell the engine to execute this iteration in a parallel thread. This lets us have a blocking implementation of the iteration without affecting other parts of query evaluation. This also uses limited memory, because the HTTP pool is limited in size.

I THINK this might have been the original way this code was supposed to work, with run() being called in a new thread. The base class (JoinExecutorBase) is built exactly for this approach, but doesn't use that. I have no idea why, maybe something was lost when FedX was upstreamed?

Anyway, I could not figure out how to write automated tests for this, sorry. But I made sure that this time this really does work without deadlocks or incomplete results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants