GH-5120: Fix SPARQL federation deadlock #5571

Ostrzyciel · 2025-11-22T10:06:23Z

GitHub issue resolved: #5120

This is an issue reported ~a year ago where RDF4J would deadlock while processing federated joins with large intermediate result sets.

I found that the issue is with BatchingServiceIteration, which is used within a federated join. When the constructor for this iteration is run, it eagerly sends out HTTP requests for the join to the other endpoint. Normally, this succeeds and the iteration results are consumed. Note that the consumer of the iteration is on the same thread as the code that starts the HTTP requests.

The assumption here (I think) is that we can send all of the requests and forget about them. However, the HTTP pool has a limited size (25 by default) and after exhausting it, we are stuck with the iteration waiting for some request to finish. However, a request can only finish if the results are consumed by the parent iteration – this can’t happen, because it’s on the same thread.

I don’t think any other operator in RDF4J does eager evaluation, so this is quite understandably very broken.

To fix this, I created a secondary queue for federated requests, and then a thread that reads from this queue and sends the request to the HTTP pool. The queue is unbounded, so this may materialize the other side of the join in memory pretty aggressively. However, (1) this was the original behavior to start with, so I'm not making this worse; (2) I don't see an easy way around this without refactoring the federated join code in a major way to allow async processing. Maybe someone has a better idea.

In any case, this does work. I ran it with the repro test case (https://github.com/tkuhn/rdf4j-timeout-test) and now the queries complete without issues, even if I issue a lot of them in parallel.

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

...src/main/java/org/eclipse/rdf4j/repository/sparql/federation/RepositoryFederatedService.java

Ostrzyciel · 2025-12-05T15:30:19Z

@hmottestad could you have a look again? I think all issues are solved, but I left them open just in case.

Ostrzyciel · 2025-12-12T17:03:20Z

We've found a bug in this... drafting this for investigation. I will also work on adding tests.

Ostrzyciel · 2025-12-12T20:16:21Z

@hmottestad we've found a bug in the previous solution, it terminated the iteration too early and we were getting incomplete results... that's completely my fault for not checking the output carefully.

Now, when running the repro case (https://github.com/tkuhn/rdf4j-timeout-test) we are getting the full results (30 rows for query 1, 1000 for query 2).

This solution is even simpler, because we simply send tell the engine to execute this iteration in a parallel thread. This lets us have a blocking implementation of the iteration without affecting other parts of query evaluation. This also uses limited memory, because the HTTP pool is limited in size.

I THINK this might have been the original way this code was supposed to work, with run() being called in a new thread. The base class (JoinExecutorBase) is built exactly for this approach, but doesn't use that. I have no idea why, maybe something was lost when FedX was upstreamed?

Anyway, I could not figure out how to write automated tests for this, sorry. But I made sure that this time this really does work without deadlocks or incomplete results.

hmottestad reviewed Nov 22, 2025

View reviewed changes

Ostrzyciel requested a review from hmottestad November 24, 2025 07:48

Ostrzyciel mentioned this pull request Dec 12, 2025

Update RDF4J, speed up nanopub indexing knowledgepixels/nanopub-query#53

Open

Ostrzyciel marked this pull request as draft December 12, 2025 17:02

eclipse-rdf4jGH-5120: Fix SPARQL federation deadlock

b0cd9a6

Ostrzyciel force-pushed the GH-5120-fix-federation-deadlock branch from 18dc62e to b0cd9a6 Compare December 12, 2025 20:10

Ostrzyciel marked this pull request as ready for review December 12, 2025 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-5120: Fix SPARQL federation deadlock #5571

GH-5120: Fix SPARQL federation deadlock #5571

Uh oh!

Ostrzyciel commented Nov 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ostrzyciel commented Dec 5, 2025

Uh oh!

Ostrzyciel commented Dec 12, 2025

Uh oh!

Ostrzyciel commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GH-5120: Fix SPARQL federation deadlock #5571

Are you sure you want to change the base?

GH-5120: Fix SPARQL federation deadlock #5571

Uh oh!

Conversation

Ostrzyciel commented Nov 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ostrzyciel commented Dec 5, 2025

Uh oh!

Ostrzyciel commented Dec 12, 2025

Uh oh!

Ostrzyciel commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ostrzyciel commented Dec 12, 2025 •

edited

Loading