-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-305: CIDv2 - Tagged Pointers #305
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,130 @@ | ||||||||||||||||||||||||||
# IPIP 0000: CIDv2 - Tagged Pointers | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
<!-- IPIP number will be assigned by an editor. When opening a pull request to | ||||||||||||||||||||||||||
submit your IPIP, please use number 0000 and an abbreviated title in the filename, | ||||||||||||||||||||||||||
`0000-draft-title-abbrev.md`. --> | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
- Start Date: 2022-08-05 | ||||||||||||||||||||||||||
- Related Issues: | ||||||||||||||||||||||||||
- https://github.com/multiformats/cid/pull/49 | ||||||||||||||||||||||||||
- [Content Addressing Lurk Data](https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
## Summary | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
<!--One paragraph explanation of the IPIP.--> | ||||||||||||||||||||||||||
Create a new [CID](https://github.com/multiformats/cid) version (CIDv2, informally "Tagged Content Identifiers") which combines a data Multicodec-Multihash pair (the pointer) and a metadata Multicodec-Multihash pair (the tag) to create content-addresses with expressive contexts. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
## Motivation | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Currently, CIDv1 data is described by a multicodec content type. However, this is meant to describe the overall format of the serialized data e.g. the `dag-cbor` IPLD encoding, and not more specific information such as a data schema or type. For example, it can be useful to have raw IPLD data contextualized by its [IPLD schema](https://ipld.io/docs/schemas/intro/). Since multicodecs are limited to 9 bytes by the [unsigned-varint spec](https://github.com/multiformats/unsigned-varint#practical-maximum-of-9-bytes-for-security), the available codec space is generally too small to encode such metadata. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
## Detailed design | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Our solution is a new CID version which contains two multicodec-multihash pairs, one pair for data and another for metadata. The metadata multicodec would be able to concisely describe a space of metadata tags where the specific tag would then be further specified by the multihash. This could be implemented as follows in Rust: | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
```rust | ||||||||||||||||||||||||||
pub struct CidV2<const S: usize, const M: usize> { | ||||||||||||||||||||||||||
/// The data multicodec | ||||||||||||||||||||||||||
data_codec: u64, | ||||||||||||||||||||||||||
/// The data multihash | ||||||||||||||||||||||||||
data_hash: Multihash<S>, | ||||||||||||||||||||||||||
/// The metadata codec | ||||||||||||||||||||||||||
meta_codec: u64, | ||||||||||||||||||||||||||
/// The metadata multihash | ||||||||||||||||||||||||||
meta_hash: Multihash<M> | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
It would serialize as follows: | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><multicodec-metadata-content-type><multihash-metadata> | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
with a multibase prefix when represented in text. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
For example, suppose you want a CID which points to a piece of IPLD data and its [IPLD schema](https://ipld.io/docs/schemas/). Let's say you have the schema `Trit`, with a particular integer representation | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
type Trit union { | ||||||||||||||||||||||||||
| True ("1") | ||||||||||||||||||||||||||
| False ("2") | ||||||||||||||||||||||||||
| Unknown ("0") | ||||||||||||||||||||||||||
} representation int | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
which corresponds to the Ipld data: `Ipld::Num(1)`, `Ipld::Num(2)`, `Ipld::Num(0)`. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
While you could in principle propose a new multicodec for `Trit`, this might be not suitable if `Trit` is a temporary or ephemeral structure, or if you have a large number of different schemas (For instance, in Lurk-lang's content-addressing we would need to reserve 16-bits of the multicodec table, or 2^16 distinct multicodecs). | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
However, since IPLD schemas can be [represented as JSON](https://ipld.io/specs/schemas/#dsl-vs-dmt) and hashed, with a CIDv2 we could reserve a single IPLD schema multicodec, along with the codec for the data representation (such as dag-cbor) | ||||||||||||||||||||||||||
We could then use the above CIDv2 definiton to create a pointer to any Schema+Data pair: | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
CidV2 { | ||||||||||||||||||||||||||
data_codec: 0x71, | ||||||||||||||||||||||||||
data_hash: <data_multihash>, | ||||||||||||||||||||||||||
meta_codec : 0x3e7ada7a, | ||||||||||||||||||||||||||
meta_hash: <schema-multihash> | ||||||||||||||||||||||||||
Comment on lines
+67
to
+68
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks great, but what happens if my schema starts becoming large? For example, say I have a 3MiB schema. Now this schema exceeds the 2MiB block limit imposed by many IPFS implementations and that 3MiB schema won't be transferrable. Maybe a 3MiB schema seems excessive, but people may go down this road for other reasons (e.g. I want my tag to be I could start playing around with a few levels of workarounds here such as:
This seems to indicate that putting type information in the CID this way is going to be problematic because types may themselves have types and so we may want to deal with them the same way we deal with the data itself (e.g. allowing the metadata to be a CIDv2 as well, or one of the other proposals). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It doesn't sound like this is a CIDv2 specific issue. I can store a 3MiB dag-cbor IPLD object on IPFS and generate a CIDv1 with its sha256 multihash, right? In terms of transport, I think CIDv2 will just behave like a pair of CIDv1s There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, it is a specific issue with this CIDv2 proposal. Nowhere else do we use codec identifiers as data types, we use them as deserialization types. As a result there is no notion of a data type changing or becoming too big, that becomes an application layer concern. For example, an object can represent a UnixFS directory whether it is a single directory block or the root of a sharded HAMT. By using the code as a nominative type rather than a description of how to deserialize the data you've navigated into a position where there's nowhere to identify both the type of the data and how to get it as a multiblock data structure without another level of indirection. However, that level of indirection could similarly be used instead of CIDv2 entirely (see alternative proposal). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Using a codec identifier as a datatype isn't an essential (or actually even intended) part of this proposal, so I'm absolutely happy to make any changes you suggest to better align with how multicodecs should be used. For example, replacing the |
||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
And thus we could then create an unambiguous hash to `Trit::True` with | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
CidV2 { | ||||||||||||||||||||||||||
data_codec: 0x71, | ||||||||||||||||||||||||||
data_hash: Ipld::Num(1).hash(), | ||||||||||||||||||||||||||
meta_hash: trit_schema.hash(), | ||||||||||||||||||||||||||
meta_codec : 0x3e7ada7a, | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
without having to reserve anything new on the multicodec table. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Modified spec file contains the following changes: | ||||||||||||||||||||||||||
- [Added a definition for CIDv2](https://github.com/yatima-inc/cid/blob/master/README.md) | ||||||||||||||||||||||||||
- [Added an implementation for CIDv2 to rust-cid](https://github.com/yatima-inc/rust-cid/tree/cid-v2) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
## Test fixtures | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| version | data multicodec | data multihash | metadata multicodec | metadata multihash | base32lower CIDv2 | | ||||||||||||||||||||||||||
|-|-|-|-|-|-| | ||||||||||||||||||||||||||
| cidv2 | raw | sha2-256-256-f3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb23adc8b7 | raw | sha2-256-256-fea3bd73e2b506e00527232b3ed743c066da83a8e3066f62a71e75eb9b4aa1db6 | bajkreib2n2yhsdzzvsd4stzyk2zn2lc5cehgqelaejq2tkjd2o5shloiw5kreihkhplt4k2qnyafe4rswpwxipagnwudvdrqm33cu4phl243jkq5wy | | ||||||||||||||||||||||||||
| cidv2 | raw | sha2-256-256-f3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb23adc8b7 | identity | identity-4-6d657461 | bajkreib2n2yhsdzzvsd4stzyk2zn2lc5cehgqelaejq2tkjd2o5shloiw4aaabdnmv2gc | | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
## Design rationale | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
This design was motivated by the desire to encode additional metadata into CIDs from a number of projects, such as [Yatima-lang](https://github.com/yatima-inc/yatima-lang), [Lurk-lang](https://github.com/lurk-lang/lurk-rs), DAG House, and IPNS-Link (see https://github.com/multiformats/cid/pull/49) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
In the case of Lurk, a tagged hash-pointer called `ScalarPtr` contains a 16-bit tag describing the type of node in the scalar graph of language terms. This tag must be included in the CID somehow in order to retrieve individual nodes without re-traversing the entire graph, so unless Lurk reserves each multicodec table entry beginning with a given 16-bit prefix (e.g. `0xC0DE`) it would be difficult if not impossible to have a CID containing both the Lurk data and its associated tag. If we then think about every other protocol which needs to include similar tags, types, or pointers in addition to their data, the multicodec table quickly becomes saturated with hundreds of entries for each application and runs out of 9-byte space. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
### User benefit | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Having arbitrary-length CID metadata allows the data to be fully self-describing and abstracts application-specific interpretation away into the metadata CID. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
### Compatibility | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this indicates the scope of what in the ecosystem will be effected by this change. The current text makes it appear as though introducing this new version of CID will be fairly trivial when that's not quite the case. Some example ramifications:
Many of these are just the cost of doing upgrades in general, or the cost of adding metadata to links, but we should accumulate these and know what we're getting into here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, but they would error cleanly, since afaiu parsers already have to match on the version varint. But I don't think its particularly complicated change to add a case for version 2, as I did in multiformats/rust-cid#123
The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. The base36 encoding of a SHA2-512 raw CID is too long to fit into a URL subdomain. e.g. https://cid.ipfs.tech/#kf1siqqaod24wzk1b0jwakpjxj8z9xaqxwh56nnc267oznfqrm8cc0w0f36g6ir7zb1tuso6ch7kg3at9o6bnr8lm34hty32o1l0ljycu is 105 characters which is greater than the 63 character DNS label limit. |
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
For backwards compatibility, the existing CIDv2 codec `0x02` could be used to allow interpretation by legacy CIDv1 application logic, e.g. | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
CidV1 { multicodec: 0x02, hash: <identity-multihash-of-cidv2-serialization> } | ||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
In the canonical CIDv2 form, the data comes before the metadata because a legacy CIDv1 parser can choose to keep only the former and discard the latter. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
### Security | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
There is likely some increased memory overhead from supporting double-wide CIDs, but this should not be significant when comparing CIDv2s of 256 bit multihash versus CIDv1s with a 512 bit multihash. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely. | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
### Alternatives | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like This could be encoded as a CIDv1 in DAG-CBOR, or using any other format you wanted. Some advantages:
Some disadvantages:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash. In the write-up I did for @vmx I go into some detail about why for Lurk we need to have the metadata tags in the pointers themselves: https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7
For large metadata, I think having a hash of the metadata is unavoidable. The advantage of this CIDv2 proposal though is that since a CIDv2 is isomorphic to a pair of CIDv1s, you can store your metadata and data in the same content-addressed store with self-describing keys. We do this in Yatima where we have large data and metadata trees for program ASTs: https://github.com/yatima-inc/yatima-lang/blob/35f868ab05a4059690e6da9db2e5c4419537fcd0/Yatima/Datatypes/Cid.lean#L23 So this proposal supports both large metadata (like Yatima's full metadata CIDs) and small metadata (like Lurk's 16-bit tags)
I think what would make sense if this proposal is adopted to allocate a single metadata multicodec for each application, whether that's IPLD Schema, Lurk, Yatima, etc., and then each application would have its own logic of what its own metadata means. E.g.
This has a similar effect as allocating ranges in the multicodec table, but without the centralized overhead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I read through the writeup, but still don't understand. What's the problem that you run into if instead of something like
you had
It seems like the bytes would be almost the same, and any code working with lurk data would already know how to do the conversion of the CIDv1 into two different objects and the use of identity multihashes saves you from doing any repeated hashing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The key difference is that the lurk tags are not legible from @porcuquine, @vmx and I had a long discussion on the Lurk discord about why this is necessary: https://discord.com/channels/908460868176596992/913200327547822110/964156408490754058 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My worry about just pushing as much as we can into CIDv1 is that we end up losing the utility of the CID because it just becomes a way to squish in arbitrary data to a point in a block. One of the main purposes of a CID in IPLD is to provide clear linking semantics between blocks. If we overloaded CIDv1 and hid the actual content address of the link in an inline portion of it then even though the blocks might load fine in existing systems, the DAG disappears because the links aren't links anymore. We end up at the same place as a CIDv2 of having to update all our systems to interpret this new thing, and while it may be less painful and give us more time to adjust, it also gives us lots of space to not upgrade at all—or to give edges of our ecosystem space to not upgrade. Turning DAGs into collections of arbitrary blocks. The choice would be something like: would you rather push your DAG to pinning service where you don't know if they support the new inline CIDv1-with-embedded-link, and therefore, just in case, you have to push them each block one by one and get them to pin each block individually. Or, have the pinning service error with I think I'm on team just accept the pain and upgrade all the things even though it's going to take time. I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that makes a lot of sense. While my initial thinking was that CIDv2 would an optional extension that would live alongside CIDv1, I think that there's certainly a way to modify CIDv2 to have it work as a CIDv1 replacement. Specifically I think what I would want to do is
And then we would need a bit to switch on whether the cid has metadata or not:
or
where everything in the parenthesis is present if If we don't want to add a whole extra varint for a single bit though, as we could actually switch on the version varint, where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on the struct but you're right about the optionality - I don't know if I have an opinion yet on having an additional bit vs making metadata mandatory for v2 and therefore requiring a v1 where there is no metadata. A third-way would be to make it mandatory if you're using a v2 but allow for the metadata to be 3 zero-bytes One thing that continues to bother me about this (I mentioned this in the other thread) is that I lose the ability to inspect initial bytes to see what's coming. Currently we can do this with just enough bytes to read 3 varints: https://github.com/multiformats/js-multiformats/blob/dcfdac59df3570b85e633afae5ac8f6caf0a4441/src/cid.js#L312-L324 Arguably the utility of this isn't as great as it seems, but I'd probably have to remove that function, or make it throw, or something else in the case of a CIDv2. Its main use is in |
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another alternative is if instead of redefining CID we redefined what Link means in the IPLD Data Model. From what I can tell CIDs are used in primarily two places:
Given that the object descriptions always have their own custom meaning anyway (e.g. Adding metadata inside of the DAG is interesting, however, changing the CID spec isn't necessary for this. You could also change what links mean in the IPLD Data Model and get the same result. Historically it appears that this was intentional, for example in https://github.com/ipld/ipld/blob/835d010583accf0dbec7f3ddbd4b6a66f86e2fa2/_legacy/specs/FOUNDATIONS.md#linked it's indicated that Links were intended to eventually allow for referring to data inside of blocks. Similar logic could extend to allowing for other kinds of type information there as well.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IPLD already feels a little like a second-class citizen in a lot of IPFS implementations, and I worry that breaking the identity between CID and IPLD::Link would just exacerbate that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can that both be true and this CIDv2 proposal be relevant? If you take the position that non-IPLD things are second class then what you're left with is basically UnixFS and then what are these tags going to do for UnixFS data? In order for the tags to be useful the IPLD tooling is going to need to expose it anyhow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's not my position. I was observing that in e.g. the IPFS http api we have two parallel sets of calls for Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information? In your example how would you expect either of kubo's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't really expect the kubo commands to change much, the additional information in a CIDv2 is primarily intended to be used at the application level.
Specific IPLD libraries like |
||||||||||||||||||||||||||
- [CIDv2 with arbitrary-precision multicodec size]( | ||||||||||||||||||||||||||
https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#appendix-a-cidv2-and-arbitrary-precision-multicodec) | ||||||||||||||||||||||||||
- CIDv2 with nested hashes | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you detail this a bit more? Is this just allowing the CIDs inside the CIDv2 to also be CIDv2's rather than restricting them to CIDv1s? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this proposal the contents of CIDv2s are not CIDv1s, but rather the broken apart multicodec-multihash pairs. This is specifically to mitigate the issues with nesting raised in the previous discussion multiformats/cid#49. The other idea of arbitrary-precision multicodec is to figure out how to safely remove the 9-byte limit on multicodec-varints (such as by adding a size field), and then managing larger metadata tags by allocating ranges on the now infinite multicodec table. However, that solution requires both technical changes to implementations, as well as process changes to how multicodec is managed, whereas the current IPIP should largely only require the former. |
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
### Copyright | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for
ipld-schema-dag-json
?It seems quite bizarre that we'd need to define multiple codes for
ipld-schema-<some ipld codec>
for any codec we might want to use to encode a schema. Basically what's happened here is we've glued back together the structure of the data and the serialized form of the data when describing type information. While sometimes users might be fine with that I suspect other times they may not, just as is the case with regular data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of things in multicodec for which this is also true (e.g. the ethereum codecs: https://github.com/multiformats/multicodec/blob/master/table.csv#L55) and my understanding of how it works there is that the format is just described in the description (such as https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/, even though you could in principle encode any RLP data as dag-cbor if you wanted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for. The ethereum codecs, like the Git ones are tied to a particular serialized data format if you wanted to transcode the data into something like dag-cbor tagging the data with the prior codec would result in a deserialization error.
Many existing hash linked data structures have more fixed representations then say the FBL ADL which is defined over arbitrary serialized forms as long as they can be decoded into a compatible IPLD Data Model layout. As a result it can appear as though the codecs are types even though they're deserialization mechanisms.
This means really what you'd need to express the type correctly is a second code to say "this is dag-json" next to the code saying it was an ipld-schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key word is necessarily there. A multicodec can absolutely tell you what the data is for. For example, if you have a CIDv1 that points to an Ethereum Block you could equally choose to encode using
Likewise we have a codec for all
cbor
and a more specific codec fordag-cbor
. And ofcraw
supersets everything.So there's nothing strange if the IPLD Schema team wanted to set a default format
or with dag-json. Yes we could have
ipld-schema-dag-json
,ipld-schema-dag-cbor
to disambiguate, but that seems like it should be an application level decision whether or not they'd want to ask for multiple multicodecs to do thatThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If doing nominative typing this way was reasonable then CIDv2 wouldn't be necessary in almost any application since you could just register every type as a different codec. IIUC this kind of thing would in theory work for the UCAN case as well, it's just that using the global code table for nominative typing like this seems bad. Applications can end up with many different named data types, sometimes it's 10s or 100s, or the many more that Lurk would require reserving codes in the table this way.
Some links around not using multicodecs for nominatives types:
Sure, but how can I do a non-disambiguated
ipld-schema
that just works like IPLD Schemas do on any IPLD Data Model data? This code field has provided nominative typing, but without enough parameters to be useful for parameterized nominative types like IPLD Schemas (or dealing with multiblock data structures as in (#305 (comment))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of glad this discussion is happening here, although I feel it might be a diversion from the main discussion—which is why it's probably good that we get this on the table now. This specific point is why I was hoping to have @vmx chime in. I worry that the Lurk specifics embedded in the doc here might be a distraction from the main goal. Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want. Specifically:
meta_codec
could be dag-cbor (0x71
), andmeta_hash
be inline (0x0
) with whatever you like for your tag—you could even embed the mega-int here that the 9-byte varints are getting in the way of currently.Perhaps that's essentially what you're aiming for through the use of a new "codec" to identify a "schema", just keeping it more efficient.
But my point again is that I think this is a distraction for the purpose of this spec. If Lurk wants to abuse the multicodec spec then that's their choice. It would be best for everyone if they want to register a new codec for this purpose to identify a "schema" in the multicodec table and we could continue this discussion there. For now, I think
0x3e7ada7a
is in the way. It stands apart from the commonly understood purpose of a CID and as this discussion is suggesting there's a weirdness about it that leads us into a deep hole (the multicodec repo has many of these deep holes, covering very similar territory, I even had this discussion specifically about rlp and the eth codecs just a couple of months ago). We accept that there are squishy edges to the concept of a "multicodec", but always work to try and keep things toward the well understood and agreed-upon center where possible.So my suggestion is to remove
0x3e7ada7a
, shunt this distraction to the multicodec repo in due course, and go with something more commonly understood - maybe just an inline dag-json blob. Then we can at least start reasoning about the basics.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I see this just as an example of how people might want to use it. I think the purpose of the proposal should be about, whether we want those CIDs with two pointers to provide additional context or not.
In regards to Lurk, I also don't think the
0x3e7ada7a
codec is needed. It could just be e.g. DAG-CBOR and you could encode your schema as such.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, we absolutely can, and this is part of the intent of having the second CID. The
0x3e7ada7a
was only meant as an illustrative example foripld-schemas
, but the rest of the proposal is the same if we just replaced it in the doc with0x71
fordag-cbor
, as @vmx suggested.Regarding the prior topic about nominative types, I don't think either Lurk or Yatima need or want to add typing to multicodec, really. To be super concrete about what we need: For Lurk, we want to add a 16-bit metadata field to our CIDs, and for Yatima we want a 256-bit metadata multihash. In terms of multicodecs, I don't think it matters that much to either of those use cases whether we get a single application codec, multiple application codecs, or no codecs (and we just use e.g.
dag-cbor
). As long as we have a flexible way to add metadata, if we do end up needing additional info in pointers, we can just put it in that variable metadata field (e.g. with the identity multihash)