Skip to content

Proposal draft: pre-shared dictionaries #36

@Ostrzyciel

Description

@Ostrzyciel

Compression methods like brotli rely on large, pre-shared dictionaries of commonly-occurring text patterns. This helps them achieve way better compression than methods that have to communicate the dictionary as part of the stream.

This is also apparently the approach of CBOR-LD: https://docs.google.com/presentation/d/1ksh-gUdjJJwDpdleasvs9aRXEmeRvqhkVWqeitx5ZAE/

Jelly has a much broader range of intended use cases, so using the same design as with CBOR-LD makes little sense. In general, I think it would be sensible to consider two main use cases:

  • General-purpose RDF processing (like we have on the open Web), where we'd use a single, pre-shared dictionary that is a part of the Jelly spec.
  • Specialized use cases (e.g., industrial systems), where the pre-shared dictionary can be fine-tuned to the use case and would be defined by the end user.

I'd suggest something along these lines:

  • Include a field in the header indicating if a pre-shared dictionary is used, as identified by an integer. Otherwise, the default dictionary would be assumed to be used.
    • The default dictionary version would be strictly tied to the protocol version, removing the need for specifying it separately.
    • If the dynamic dictionary and the pre-shared dictionary are to use the same ID space, I guess it would make sense to disable the pre-shared dictionary entirely, so that the dynamic IDs are in a lower space.
  • Add new fields to RdfIri and RdfDatatype for referencing the pre-shared dictionary, OR make it so that the dictionary is mapped to a part of the lookup space (probably the lowest values?).
    • The latter approach will result in pretty big IDs for dynamic entries, which isn't great. The main drawback of the 1st approach is the effect it will have on 0-compression. I'll have to consider this carefully.
    • It may also be considered to have one-integer references to the most common IRIs. This can be done easily with a custom dictionary with only the name table. I'm not sure if it's worth it to introduce an additional ID space for such in the global dictionary. It would add A LOT to the complexity.
  • The pre-shared dictionaries would cover all standardized prefixes and datatypes, along with the most common names, custom DTs, prefixes.
    • In brotli they did that by mining a large dump of web data. Could we do something like this with LOD? I think someone recently did a large-scale LOD dump and even ran some analytics on that. Dig it up.
  • We could have a proto for sending pre-shared dictionaries to interested parties. See: Proposal: proto for stream "presets" #17

Metadata

Metadata

Assignees

No one assigned

    Labels

    new protocol featureDiscussion about a new feature in the Jelly protocol

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions