Skip to content

Investigation: statement pattern compression #38

@Ostrzyciel

Description

@Ostrzyciel

Note: This is a yet another feature that may not be a great idea.

Investigate how and if to implement pattern-based compression.

In such a compression scheme, we'd have a dictionary of patterns, where each pattern is a sequence of RDF statements. These statements would have some terms defined, and some omitted (replaced with a variable). For example:

DefinePattern(
  id = 5,
  triples = (
    Triple(Iri1, Iri2, VAR)
    Triple(Iri3, VAR, VAR)
  )
)

Then, we could materialize the pattern in the stream like this:

UsePattern(
  id = 5,
  terms = (Literal1, Iri4, Literal2)
)

Which would be equivalent to:

Triple(Iri1, Iri2, Literal1)
Triple(Iri3, Iri4, Literal2)

This would be especially useful in scenarios where there are regular statement patterns that repeat often, like in IoT messages.

We could employ both a static (pre-shared) and a dynamic dictionary of these patterns, using an approach like in #36. Pre-shared pattern dictionaries together with pre-shared string dictionaries would be very powerful and would improve compression efficiency greatly in IoT scenarios.

The pre-shared patterns possibly could also contain stream options, so that we can minimize the amount of data transmitted to an absolute bare minimum.

Downsides and alternatives

Employing such aggressive compression in all streams would be prohibitively expensive, slowing down serialization. Realistically, the only scenario where this is would be useful is places where we know the structure of the triples up-front, or can spend the time to analyze the data in-depth, like in some IoT applications.

I see two main alternatives:

  • For general applications, diff-based compression (e.g., based on Proposal: Jelly extension for RDF Patch #11) would be way simpler to implement and more generalized.
  • For optimized IoT scenarios, we could seriously look into making an off-spin version of Jelly that is specifically optimized for pre-shared dictionaries, pre-shared patterns, and super-efficient literal encoding. Something like S-HDT or RDF EXI (though I am not aware of any public implementation for either...). Question is – do we need this? Should Jelly try to serve this use case, or is there a better way? This seems to start to overlap a bit with the stated goals of CBOR-LD, and that's a very, very different beast.

Metadata

Metadata

Assignees

No one assigned

    Labels

    new protocol featureDiscussion about a new feature in the Jelly protocol

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions