Skip to content

GH-3026: spatial index per graph and kryo serialization #3027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Aklakan
Copy link
Contributor

@Aklakan Aklakan commented Feb 21, 2025

GitHub issue resolved #3026

Pull request Description: Adds support for spatial index per graph and kryo serialization.

  • Updated jena-geosparql with spatial index-per-graph (STRtreePerGraph.java) structure and kryo serialization.

  • The upgrade should be backward compatible: If reading of a spatial index file fails then the old (slow) deserialization method will be attempted.

  • Added jena-fuseki-mod-geosparql module under jena-fuseki2. This module comes with a simple HTML page where users can pick which graphs to update in the index:
    image

  • Added jena-fuseki-mod-geosparql to jena-fuseki-server. This bundles the fmod (and geosparql) with Fuseki, but --modules=true must still be set to enable it!


  • Additional tests for the GeoSPARQL module are included.
  • Tests for the spatial-indexer fuseki module and its UI are included. Note, that the test uses Selenium to click a couple of buttons in the browser which requires a local Chrome. Therefore the test is set to ignore by default.
  • Benchmark setup is included.
  • Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan Aklakan marked this pull request as draft February 21, 2025 20:46
@afs
Copy link
Member

afs commented Feb 26, 2025

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 3 times, most recently from 0829aa5 to 01bad5d Compare February 26, 2025 12:51
@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

It should build now. Its about 90% done. One issue is, that @LorenzBuehmann used ShapeSerde from sedona-spark which actually only does JTS geometry de-/serialization. Probably it is faster than other approaches (otherwise it wouldn't exist) - but it causes enforcer conflicts and draws in quite a lot of dependencies for what it does.

Right now I switched to JTS native WKB serialization which makes the build work normally (in principle, the geometry-serializer classname could be made part of a spatial index metadata header - such a header does not exist yet).
I am waiting for input from Lorenz about the performance differences between the serializers.

I also added a little compatibility layer for the old index IO which used java serialization. So old index files should continue to load. Saving indexes will use the new serializer.

@afs
Copy link
Member

afs commented Feb 26, 2025

dependencies

The dependencies will need to be checked - kryo is 3-clause BSD so that has implications.

(I haven't looked at the new dependencies, just kyro.)

Apache Sedona has some entries in its LICENSE
https://github.com/apache/sedona/blob/ebd6f67c5029554055db6b81ffba5d3ecaad3b62/LICENSE

enforcer conflicts

These are a nuisance because they are not guaranteed to be maintained by dependency updates. 😞

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

About performance differences, @SimonBin found this article: https://kontinuation.one/posts/2022/12/improving-geometry-serde-performance-of-apache-sedona/

image

This suggests that Sedona's ShapeSerde is generally a lot faster. Not sure if the current implementation is perhaps already even one of the optimized variants.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

Some notes about the remaining issues I am working on to resolve.

I am a bit worried that a hard dependency on sedona-spark may cause too many issues, so I am trying to decouple the geometry serde (serializer/deserializer) from the rest of the spatial-index serde.

For this, I added a plain text JSON header field to the spatial index format with a field for the geometry serde. This field can be read and used to configure the remaining kryo serde. The index reader makes a lookup with Class.forName() and tries to create an instance via the default constructor.

With this approach it would be possible to use the slower JTS-based geometry serialization for now and have a compatible upgrade path to a faster approach at a later point.

Perhaps the geometrySerde header field is not needed, because I see that in Sedona the implementation is serializable and apparently their approach is to just read/write the geometry serde instance out using plain Java serialization during the kryo serialization process. So I need to check what is the best way, but something along these lines should work to support the same spatial-index serde with different geometry serdes in a compatible way.

This snippet is an example for how the header could be set up:

// SpatialIndexIoKryo.java

    public static void writeToOutputStream(OutputStream os, SpatialIndexPerGraph index) {
        // geometrySerde could later be switched to sedona's ShapeSerde.
        // (Need to check whether I can readily reuse an interface
        // from JTS/Sedona instead of my own GeometrySerdeAdapter)
        GeometrySerdeAdapter geometrySerde = new GeometrySerdeAdapterJtsWkb(); 

        JsonObject header = new JsonObject();
        header.addProperty("type", "jena-spatial-index");
        header.addProperty("version", "2.0.0");
        header.addProperty("geometrySerde", geometrySerde.getClass().getName());

        Kryo kryo = new Kryo();
        JtsKryoRegistrator.registerClasses(kryo, geometrySerde);
        try (Output output = new Output(os)) {
            writeJson(output, header);
            writeIndex(kryo, output, index);
            output.flush();
        }

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 2 times, most recently from b0a9b8b to ae4ba13 Compare February 27, 2025 21:00
Comment on lines 31 to 32
/* This file is an adapted copy of org.locationtech.jts.index.strtree.IndexSerde. */
/* The change is, that geometry read/write is delegated to a GeometrySerdeAdapter instance. */
Copy link
Contributor Author

@Aklakan Aklakan Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is copy&adapt from sedona-common 1.7.0 (latest)- they actually used this package hack to access non-visible stuff. in the long run, jts should fix this design.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 27, 2025

So I had to copy and adapt a total of 3 1 classes from sedona-common:

They all have an Apache V2 header - so it should be fine?
I also excluded nearly all dependencies from sedona-commons and the tests work.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 28, 2025

The spatial index does not store any geometries - just their envelopes and feature nodes - so geometry serialization performance is not an issue in the first place 🤦‍♂️ - but good having talked about it 😀

  • I managed to remove sedona as a dependency alltogether - only sedona's copy for the STRtree serializer (from JTS) is needed.
  • Kryo is needed.
  • I added tests for the corner case where the same feature exists with different geometries in different graphs.
    For the original spatial index implementation, some results are incorrect - for the new one they are correct.
  • The kryo serializer for RDF1.2 Nodes is now also tested.

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 2 times, most recently from 52317c9 to fcae47a Compare March 2, 2025 02:08
@Aklakan
Copy link
Contributor Author

Aklakan commented Mar 2, 2025

I now also added the benchmark that evaluates loading a spatial index from disk - so w.r.t. index loading during server startup, the kryo version is (still) significantly faster.

Benchmark                   (param0_jenaVersion)  (param1_geometryMixes)  Mode  Cnt  Score   Error  Units
BenchmarkSpatialIndex.load               current                    1000  avgt    5  0.002 ± 0.001   s/op
BenchmarkSpatialIndex.load               current                   10000  avgt    5  0.006 ± 0.001   s/op
BenchmarkSpatialIndex.load               current                  100000  avgt    5  0.064 ± 0.004   s/op

BenchmarkSpatialIndex.load                 5.1.0                    1000  avgt    5  0.062 ± 0.009   s/op
BenchmarkSpatialIndex.load                 5.1.0                   10000  avgt    5  0.565 ± 0.471   s/op
BenchmarkSpatialIndex.load                 5.1.0                  100000  avgt    5  7.439 ± 0.442   s/op

A geometry mix comprises 8 different geometry types (point, linestring, linearring, polygon, multipoint, multilinestring, multipolygon, geometrycollection). So the actual number of envelopes in the spatial index is 8 x geometryMixes.

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 2 times, most recently from 16570ee to fe69be3 Compare March 2, 2025 14:13
@Aklakan
Copy link
Contributor Author

Aklakan commented Mar 3, 2025

@afs What would be the best way to add a Fuseki module for re-building the index for selected graphs? In an ideal world, the spatial index would be synchronized with the dataset, but for the time being we created a little Fuseki-Mod, where one can select the graphs for which to rebuild the index. Back then we forked your afs/fuseki-mods git repo - but this no longer exists. At first glance, it seems they were moved to jena-fuseki2/jena-fuseki-main/src/main/java/org/apache/jena/fuseki/mod/*. I suppose the mod could be added to jena-geosparql, but I think the geosparql support should be just a fuseki mod (IIRC jena-geosparql is not a mod) - perhaps with a bundle release for convenience, but not as a fork - so perhaps there is a better way to handle such a mod.

This is what the spatial index rebuilder looks like. The html is currently a single self-contained file (ai generated) - so no extra frontend build is needed.

image

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 3 times, most recently from 6c202b9 to 164b9f7 Compare April 6, 2025 08:25
@Aklakan
Copy link
Contributor Author

Aklakan commented Apr 29, 2025

Right, I'll look into cleaning up those warnings; especially I haven't checked the javadoc much yet. As for deprecation, I recall that adding forRemoval=true caused even more warnings, so I removed it again. As for the deprecation warnings related to SpatialIndexV1, I wanted to keep the implementation for compatibility and also comparison reasons, although it should be considered deprecated.

Not sure what do best do there; for the time being perhaps disable deprecation warnings for the module (if possible) or remove the deprecations for now.

@afs
Copy link
Member

afs commented Apr 29, 2025

No need to use forRemoval=true if it does not work for you.
As long as if something is planned for removal eventually, it has @deprecated of some kind.

(I find it a bit erratic as in javadoc across Eclispe and the maven javadoc plugin - it seems to be considered a different warning, sometimes)

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 4 times, most recently from adc16e6 to 3aa8be2 Compare May 1, 2025 13:27
@afs
Copy link
Member

afs commented May 3, 2025

Here are the Java warnings I am seeing in Eclipse:

Eclipse Warnings
The method load(File) from the type SpatialIndexV1 is deprecated
SpatialIndexIo.java
/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/index/compat
line 51

The type SpatialIndexV1 is deprecated
SpatialIndexIo.java
/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/index/compat
line 51

The type SpatialIndexV1 is deprecated
SpatialIndexIo.java
/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/index/compat
line 51

Type safety: Unchecked cast from Object to Map<Node,STRtree>
STRtreePerGraphSerializer.java
/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/index/v2
line 41

The method write(Kryo, Output, STRtree) of type STRtreeSerializer should be tagged with @Override since it actually overrides a superclass method
STRtreeSerializer.java
/jena-geosparql/src/main/java/org/locationtech/jts/index/strtree
line 68

Javadoc: Parameter includeWgs84 is not declared
QueryRewriteTestData.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/geo/topological
line 60

The import org.apache.jena.rdf.model.ModelFactory is never used
QueryRewriteTestData.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/geo/topological
line 27

The method buildSpatialIndex(Dataset, String) from the type SpatialIndexV1 is deprecated
TestSpatialIndexGraphLookupV1.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial
line 32

The type SpatialIndexV1 is deprecated
TestSpatialIndexGraphLookupV1.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial
line 32

The type SpatialIndexV1 is deprecated
TestSpatialIndexGraphLookupV1.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial
line 32

The method buildSpatialIndex(Dataset, File) from the type SpatialIndexV1 is deprecated
SpatialIndexTest.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial/index/v2
line 130

The method createTripleNode(Triple) from the type NodeFactory is deprecated
TestNodeSerializer.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial/index/v2
line 74

The method save(File, Collection<SpatialIndexItem>, String) from the type SpatialIndexV1 is deprecated
SpatialIndexTest.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial/index/v2
line 58

The type SpatialIndexV1 is deprecated
SpatialIndexTest.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial/index/v2
line 58

The type SpatialIndexV1 is deprecated
SpatialIndexTest.java
/jena-geosparql/src/test/java/org/apache/jena/geosparql/spatial/index/v2
line 130

@afs
Copy link
Member

afs commented May 3, 2025

jena-fuseki-mod-geosparql is shaded.

Making a dependency of jena-fuseki-server produces a lot (!!) of overlap warnings.

Either:

  1. jena-fuseki-mod-geosparql could go in as module in lib/ in the apache-jena-fuseki packaging
  2. Not be shaded.

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 3 times, most recently from 1d1c564 to eb50b4f Compare May 5, 2025 13:53
@Aklakan
Copy link
Contributor Author

Aklakan commented May 5, 2025

No need to use forRemoval=true if it does not work for you.

I realized that when forRemoval=true is used then suppression requires @SuppressWarnings("removal") rather than @SuppressWarnings("deprecation").

All your reported warnings (and some more) should now be cleaned up - I updated my Eclipse error/warning level configuration accordingly.

@Aklakan
Copy link
Contributor Author

Aklakan commented May 5, 2025

jena-fuseki-mod-geosparql is shaded.

Making a dependency of jena-fuseki-server produces a lot (!!) of overlap warnings.

Either:

1. jena-fuseki-mod-geosparql could go in as module in `lib/` in the apache-jena-fuseki packaging

2. Not be shaded.

My intention was to go with 2 - but I had overlooked the shade configuration for the package goal. I moved the shading into a bundle profile akin to the setup in the service enhancer pom.xml.

The bundle produced by the jena-fuseki-mod-geosparql module should be suitable for the /lib folder (if that's desired) - but I yet need to test that.

@@ -116,6 +116,7 @@
<ver.org.openjdk.jmh>1.37</ver.org.openjdk.jmh>

<!-- GeoSPARQL related -->
<ver.kryo>4.0.3</ver.kryo>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need an entry in jena-fuseki2/apache-jena-fuseki/dist/LICENSE
(Anywhere Jena is actually shipping kryo needs to checked)

@@ -42,7 +42,7 @@ public static void init() {
if ( initialized )
return ;
synchronized (initLock) {
if ( initialized ) {
if ( initialized || System.getProperty("jena.geosparql.skip", "false").equalsIgnoreCase("true") ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's going to need documenting!

Copy link
Contributor Author

@Aklakan Aklakan May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LorenzBuehmann Do you recall the intention of this line?

One reason I can think of is that that jena-geosparql takes long to initialize. IIRC due to setting up a derby database with SIS data. So any command line jar that uses jena-geosparql takes long to load.
The loading time of own java cli tool increased from 200ms to 1.000ms just by adding jena-geosparql.

So a future issue is to change parts of jena-geosparql's eager init to lazy init on first use.
From this persective, the feature to disable jena-geosparql could be removed from this PR and handled in a separate PR. I think the option would not be needed if the init time was improved.

@afs
Copy link
Member

afs commented May 7, 2025

My intention was to go with 2
(jena-fuseki-mod-geosparql is not shared)

I agree this is the way to go.

@afs
Copy link
Member

afs commented May 7, 2025

Where does this leave jena-fuseki-geosparql long term? If jena-fuseki-mod-geosparql integrated into jena-fuseki-server is a replacement, we could drop jena-fuseki-geosparql at Jena6 (late this year, early next).

That loses the command line argument processing (is the geosparql assembler sufficient?) - we could ask users about this.

@Aklakan
Copy link
Contributor Author

Aklakan commented May 9, 2025

According to the jena-fuseki-geosparql documentation it is a bit of a jena-geo toolkit:

  • it adds its own cli options to fuseki, such as setting a spatial index file or whether to enable geosparql query rewriting. There the question is whether it would be possible to write plugins that extend fuseki's CLI. I think the clean/reliable way would be to use the assembler, but having a quick way to get a fuseki running with these features is still useful.

  • some commands add data processing that could also be accomplished with sparql queries. These could be perhaps moved to a geo cli tool akin to arq or riot. Also, there could be an fmod that adds a little geo workbench html site with buttons to carry out these tasks (Validate Geometry Literals, Convert Geo predicates, ...)

  • it adds support for loading data based on a file-to-graph mapping file ("Load Tabular file into dataset"). This could be of interest for the arq/sparql (or similar) tool which executes it as a set of LOAD statements or a general fuseki-client-cli which may use fuseki specific APIs for performance/security reasons. Not sure right now whether e.g. sparql LOAD without a custom Locator could access file:// URLs.

So I think in the long run there should be only one Fuseki, and the features of jena-fuseki-geosparql would be either moved to mods or separate cli tools. But before that happens I guess things just stay as they are(?)

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 4 times, most recently from 782f293 to 21b1b5a Compare May 10, 2025 18:15
@afs afs self-requested a review May 11, 2025 20:50
@afs
Copy link
Member

afs commented May 15, 2025

So I think in the long run there should be only one Fuseki, and the features of jena-fuseki-geosparql would be either moved to mods or separate cli tools. But before that happens I guess things just stay as they are(?)

Fair enough.

Is the geosparql assembler sufficient?

@afs afs added the GeoSPARQL label May 15, 2025
@afs afs self-assigned this May 15, 2025
@afs
Copy link
Member

afs commented May 15, 2025

  • MainPlaygroundFusekiModGeoSparql - this should be removed?
  • Please rebase the PR to current main (the rebase is clean)
  • Documentation (create an issue for it if that helps) and some sentences for the release announcement at some point

There is:

  1. This will need an entry in jena-fuseki2/apache-jena-fuseki/dist/LICENSE
    That is a task on me. It's not a huge task.
    To make sure it does not get lost, I'd rather wait until there is a PR for that.

  2. Noted - System property - InitGeoSPARQL.java

  3. SpatialIndexComputeService - "TODO Raise some HTTP error"
    Probably only would affect the reporting of a problem.

If that is all OK then, with those last items, this is good then from my side - this is good to go.
Morally, approved.
Assigned to me for the license task.

@Aklakan
Copy link
Contributor Author

Aklakan commented May 15, 2025

Two aspects I came across that I should still address:

  • A security issue: "Remove absent graphs from index" (and also "Replace index with selected graphs"): This needs to consider DynamicDatasets . An authenticated user must not be able to remove graphs from the index which for which he is not authorized. However, right now if a user clicks on this button it will remove all the index entries for any graph which he can't see. I think this can be solved by unwrapping the DynamicDataset to get the underlying complete graph list and use it to prevent removal of any graphs present in there.

  • A small one for which I should write a test case: JTS STRtree has an 'isBuilt' flag; and for this reason my STRtreePerGraph structure has such a flag too: If STRtreePerGraph.isBuilt is true then any added STRtree should be built on insert. I should check whether serialization w.r.t. flag works properly; would be ugly if the index format broke for this reason.

  • "TODO Raise some HTTP error"
    I also had a look into the existing Fuseki code (e.g. SPARQL_QueryGeneral) to see how to do the error handling properly - I should update this part to something like ServletOps.errorBadRequest("Failed to load URL " + uri);

So I should be able to resolve this issues this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spatial Index improvements (index-per-graph + kryo)
2 participants