-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Note: this is an idea that I'm not very optimistic about, but I'm putting it here regardless for future reference.
Investigate how and if to introduce RDF term compression based on look-back references.
Something along these lines:
Triple(Iri1, Iri2, Literal1)
Triple(Iri3, Iri4, Iri5)
Triple(Iri6, Iri7, Back(-6)) # translates to Triple(Iri6, Iri7, Literal1)
This would be very useful for dealing with blank nodes and literals (see below). Unfortunately, the implementation would not be trivial.
Motivation
Blank nodes
Currently we are (in practice) taking the approach of simply taking the internal blank node identifiers of the RDF store and passing them through to the serialization without any further processing.
This works, and it has the advantage that the serializer doesn't have to worry at all about blank node semantics or identifiers. But, it also takes up a lot of space, because these identifiers are usually pretty long hashes (often also base16-encoded, making it even worse).
One alternative would be to use numerical IDs for blank nodes, but that would force the serializer to actively "think" about what the blank nodes mean, store them in some dedicated hashtable, etc.
Literals – there is currently no way to deal with repeating literals, unless they occur in consecutive triples, then we can use repeated term compression. Such literals can be repeated in e.g., dates, labels, frequent numerical values, etc.
Quoted triples – RDF 1.2 will completely change that... but anyway, in RDF-star the same applies to quoted triples. Only way to compress them is with repeated term compression.
Implementation
Of course, we'd have to decide how to encode these back-references in an efficient manner.
There would also have to be a limit on how deep could these references reach back into the stream, probably specified in the stream options. Still, it would add a new responsibility for the decoder, which would have to keep a rolling table of RDF terms. On the encoder side this should not be too hard, if the encoder already uses node caches (Jelly-JVM does). But it would require to also have a way to track when was the last time the term occurred, which is non-trivial.
Encoding IRIs in this way would not lead to large savings. It's entirely possible that this will actually make the compression efficiency worse, because IRIs are so well-compressed right now.
Finally, it would require deep changes in the proto, probably introducing new messages for RdfTriple and RdfQuad. But I guess we can live with that.
Summary
This would be really complicated. Seriously consider a better encoding scheme for blank nodes as an alternative solution.