Investigation: look-back term encoding

***Note:** this is an idea that I'm not very optimistic about, but I'm putting it here regardless for future reference.*

Investigate how and if to introduce RDF term compression based on look-back references.

Something along these lines:

```
Triple(Iri1, Iri2, Literal1)
Triple(Iri3, Iri4, Iri5)
Triple(Iri6, Iri7, Back(-6)) # translates to Triple(Iri6, Iri7, Literal1)
```

This would be very useful for dealing with blank nodes and literals (see below). Unfortunately, the implementation would not be trivial.

### Motivation

**Blank nodes**

Currently we are (in practice) taking the approach of simply taking the internal blank node identifiers of the RDF store and passing them through to the serialization without any further processing.

This works, and it has the advantage that the serializer doesn't have to worry *at all* about blank node semantics or identifiers. But, it also takes up a lot of space, because these identifiers are usually pretty long hashes (often also base16-encoded, making it even worse).

One alternative would be to use numerical IDs for blank nodes, but that would force the serializer to actively "think" about what the blank nodes mean, store them in some dedicated hashtable, etc.

**Literals** – there is currently no way to deal with repeating literals, unless they occur in consecutive triples, then we can use repeated term compression. Such literals can be repeated in e.g., dates, labels, frequent numerical values, etc.

**Quoted triples** – RDF 1.2 will completely change that... but anyway, in RDF-star the same applies to quoted triples. Only way to compress them is with repeated term compression.

### Implementation

Of course, we'd have to decide how to encode these back-references in an efficient manner.

There would also have to be a limit on how deep could these references reach back into the stream, probably specified in the stream options. Still, it would add a new responsibility for the decoder, which would have to keep a rolling table of RDF terms. On the encoder side this should not be *too* hard, if the encoder already uses node caches (Jelly-JVM does). But it would require to also have a way to track when was the last time the term occurred, which is non-trivial.

Encoding IRIs in this way would not lead to large savings. It's entirely possible that this will actually make the compression efficiency worse, because IRIs are so well-compressed right now.

Finally, it would require deep changes in the proto, probably introducing new messages for `RdfTriple` and `RdfQuad`. But I guess we can live with that.

### Summary

This would be really complicated. Seriously consider a better encoding scheme for blank nodes as an alternative solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: look-back term encoding #37

Motivation

Implementation

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigation: look-back term encoding #37

Description

Motivation

Implementation

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions