Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-305: CIDv2 - Tagged Pointers #305

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions IPIP/0000-cidv2-tagged-content-identifiers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# IPIP 0000: CIDv2 - Tagged Pointers

<!-- IPIP number will be assigned by an editor. When opening a pull request to
submit your IPIP, please use number 0000 and an abbreviated title in the filename,
`0000-draft-title-abbrev.md`. -->

- Start Date: 2022-08-05
- Related Issues:
- https://github.com/multiformats/cid/pull/49
- [Content Addressing Lurk Data](https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7)

## Summary

<!--One paragraph explanation of the IPIP.-->
Create a new [CID](https://github.com/multiformats/cid) version (CIDv2, informally "Tagged Content Identifiers") which combines a data Multicodec-Multihash pair (the pointer) and a metadata Multicodec-Multihash pair (the tag) to create content-addresses with expressive contexts.

## Motivation

Currently, CIDv1 data is described by a multicodec content type. However, this is meant to describe the overall format of the serialized data e.g. the `dag-cbor` IPLD encoding, and not more specific information such as a data schema or type. For example, it can be useful to have raw IPLD data contextualized by its [IPLD schema](https://ipld.io/docs/schemas/intro/). Since multicodecs are limited to 9 bytes by the [unsigned-varint spec](https://github.com/multiformats/unsigned-varint#practical-maximum-of-9-bytes-for-security), the available codec space is generally too small to encode such metadata.

## Detailed design

Our solution is a new CID version which contains two multicodec-multihash pairs, one pair for data and another for metadata. The metadata multicodec would be able to concisely describe a space of metadata tags where the specific tag would then be further specified by the multihash. This could be implemented as follows in Rust:

```rust
pub struct CidV2<const S: usize, const M: usize> {
/// The data multicodec
data_codec: u64,
/// The data multihash
data_hash: Multihash<S>,
/// The metadata codec
meta_codec: u64,
/// The metadata multihash
meta_hash: Multihash<M>
}
```

It would serialize as follows:

```
<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><multicodec-metadata-content-type><multihash-metadata>
```

with a multibase prefix when represented in text.

For example, suppose you want a CID which points to a piece of IPLD data and its [IPLD schema](https://ipld.io/docs/schemas/). Let's say you have the schema `Trit`, with a particular integer representation

```
type Trit union {
| True ("1")
| False ("2")
| Unknown ("0")
} representation int
```

which corresponds to the Ipld data: `Ipld::Num(1)`, `Ipld::Num(2)`, `Ipld::Num(0)`.

While you could in principle propose a new multicodec for `Trit`, this might be not suitable if `Trit` is a temporary or ephemeral structure, or if you have a large number of different schemas (For instance, in Lurk-lang's content-addressing we would need to reserve 16-bits of the multicodec table, or 2^16 distinct multicodecs).

However, since IPLD schemas can be [represented as JSON](https://ipld.io/specs/schemas/#dsl-vs-dmt) and hashed, with a CIDv2 we could reserve a single IPLD schema multicodec, along with the codec for the data representation (such as dag-cbor)
We could then use the above CIDv2 definiton to create a pointer to any Schema+Data pair:

```
CidV2 {
data_codec: 0x71,
data_hash: <data_multihash>,
meta_codec : 0x3e7ada7a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json?

It seems quite bizarre that we'd need to define multiple codes for ipld-schema-<some ipld codec> for any codec we might want to use to encode a schema. Basically what's happened here is we've glued back together the structure of the data and the serialized form of the data when describing type information. While sometimes users might be fine with that I suspect other times they may not, just as is the case with regular data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json

There are a lot of things in multicodec for which this is also true (e.g. the ethereum codecs: https://github.com/multiformats/multicodec/blob/master/table.csv#L55) and my understanding of how it works there is that the format is just described in the description (such as https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/, even though you could in principle encode any RLP data as dag-cbor if you wanted)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of things in multicodec for which this is also true ... the ethereum codecs ... even though you could in principle encode any RLP data as dag-cbor if you wanted

IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for. The ethereum codecs, like the Git ones are tied to a particular serialized data format if you wanted to transcode the data into something like dag-cbor tagging the data with the prior codec would result in a deserialization error.

Many existing hash linked data structures have more fixed representations then say the FBL ADL which is defined over arbitrary serialized forms as long as they can be decoded into a compatible IPLD Data Model layout. As a result it can appear as though the codecs are types even though they're deserialization mechanisms.

This means really what you'd need to express the type correctly is a second code to say "this is dag-json" next to the code saying it was an ipld-schema.

Copy link
Author

@johnchandlerburnham johnchandlerburnham Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for.

The key word is necessarily there. A multicodec can absolutely tell you what the data is for. For example, if you have a CIDv1 that points to an Ethereum Block you could equally choose to encode using

name tag code description
rlp serialization 0x60 recursive length prefix
eth-block ipld 0x90 Ethereum Header (RLP)

Likewise we have a codec for all cbor and a more specific codec for dag-cbor. And ofc raw supersets everything.

So there's nothing strange if the IPLD Schema team wanted to set a default format

name tag code description
ipld-schema ipld 0xdead_beef Ipld Schema (dag-cbor)

or with dag-json. Yes we could have ipld-schema-dag-json, ipld-schema-dag-cbor to disambiguate, but that seems like it should be an application level decision whether or not they'd want to ask for multiple multicodecs to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key word is necessarily there. A multicodec can absolutely tell you what the data is for.

If doing nominative typing this way was reasonable then CIDv2 wouldn't be necessary in almost any application since you could just register every type as a different codec. IIUC this kind of thing would in theory work for the UCAN case as well, it's just that using the global code table for nominative typing like this seems bad. Applications can end up with many different named data types, sometimes it's 10s or 100s, or the many more that Lurk would require reserving codes in the table this way.

Some links around not using multicodecs for nominatives types:

Yes we could have ipld-schema-dag-json, ipld-schema-dag-cbor to disambiguate, but that seems like it should be an application level decision

Sure, but how can I do a non-disambiguated ipld-schema that just works like IPLD Schemas do on any IPLD Data Model data? This code field has provided nominative typing, but without enough parameters to be useful for parameterized nominative types like IPLD Schemas (or dealing with multiblock data structures as in (#305 (comment))

Copy link
Member

@rvagg rvagg Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of glad this discussion is happening here, although I feel it might be a diversion from the main discussion—which is why it's probably good that we get this on the table now. This specific point is why I was hoping to have @vmx chime in. I worry that the Lurk specifics embedded in the doc here might be a distraction from the main goal. Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want. Specifically: meta_codec could be dag-cbor (0x71), and meta_hash be inline (0x0) with whatever you like for your tag—you could even embed the mega-int here that the 9-byte varints are getting in the way of currently.
Perhaps that's essentially what you're aiming for through the use of a new "codec" to identify a "schema", just keeping it more efficient.

But my point again is that I think this is a distraction for the purpose of this spec. If Lurk wants to abuse the multicodec spec then that's their choice. It would be best for everyone if they want to register a new codec for this purpose to identify a "schema" in the multicodec table and we could continue this discussion there. For now, I think 0x3e7ada7a is in the way. It stands apart from the commonly understood purpose of a CID and as this discussion is suggesting there's a weirdness about it that leads us into a deep hole (the multicodec repo has many of these deep holes, covering very similar territory, I even had this discussion specifically about rlp and the eth codecs just a couple of months ago). We accept that there are squishy edges to the concept of a "multicodec", but always work to try and keep things toward the well understood and agreed-upon center where possible.

So my suggestion is to remove 0x3e7ada7a, shunt this distraction to the multicodec repo in due course, and go with something more commonly understood - maybe just an inline dag-json blob. Then we can at least start reasoning about the basics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my point again is that I think this is a distraction for the purpose of this spec.

I agree. I see this just as an example of how people might want to use it. I think the purpose of the proposal should be about, whether we want those CIDs with two pointers to provide additional context or not.

In regards to Lurk, I also don't think the 0x3e7ada7a codec is needed. It could just be e.g. DAG-CBOR and you could encode your schema as such.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want.

Just to clarify, we absolutely can, and this is part of the intent of having the second CID. The 0x3e7ada7a was only meant as an illustrative example for ipld-schemas, but the rest of the proposal is the same if we just replaced it in the doc with 0x71 for dag-cbor, as @vmx suggested.

Regarding the prior topic about nominative types, I don't think either Lurk or Yatima need or want to add typing to multicodec, really. To be super concrete about what we need: For Lurk, we want to add a 16-bit metadata field to our CIDs, and for Yatima we want a 256-bit metadata multihash. In terms of multicodecs, I don't think it matters that much to either of those use cases whether we get a single application codec, multiple application codecs, or no codecs (and we just use e.g. dag-cbor). As long as we have a flexible way to add metadata, if we do end up needing additional info in pointers, we can just put it in that variable metadata field (e.g. with the identity multihash)

meta_hash: <schema-multihash>
Comment on lines +67 to +68
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but what happens if my schema starts becoming large? For example, say I have a 3MiB schema. Now this schema exceeds the 2MiB block limit imposed by many IPFS implementations and that 3MiB schema won't be transferrable. Maybe a 3MiB schema seems excessive, but people may go down this road for other reasons (e.g. I want my tag to be wasm-module and my WASM code happens to be large).

I could start playing around with a few levels of workarounds here such as:

  1. Get a new code for unixfs-representation-of-schema
    • Sad because now I need to change my code to process schema and unixfs-representation-of-schema as schemas
    • Sad because I need a new code for every different system I use to encode my bytes (UnixFS, FBL, BitTorrent v1/v2, WNFS, etc.)
  2. Make a new type ipld-ADL-wrapper-dag-cbor that looks like { TypeData: <type-cid>, TypeADL: "unixfs" } encoded as dag-cbor, and add code so that when I encounter an ipld-ADL-wrapper-dag-cbor I recurse in a layer
    • Sad because it seems like we end up having to make our own type system anyhow despite the CIDv2 version bump

This seems to indicate that putting type information in the CID this way is going to be problematic because types may themselves have types and so we may want to deal with them the same way we deal with the data itself (e.g. allowing the metadata to be a CIDv2 as well, or one of the other proposals).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but what happens if my schema starts becoming large?

It doesn't sound like this is a CIDv2 specific issue. I can store a 3MiB dag-cbor IPLD object on IPFS and generate a CIDv1 with its sha256 multihash, right? In terms of transport, I think CIDv2 will just behave like a pair of CIDv1s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound like this is a CIDv2 specific issue.

No, it is a specific issue with this CIDv2 proposal. Nowhere else do we use codec identifiers as data types, we use them as deserialization types. As a result there is no notion of a data type changing or becoming too big, that becomes an application layer concern. For example, an object can represent a UnixFS directory whether it is a single directory block or the root of a sharded HAMT.

By using the code as a nominative type rather than a description of how to deserialize the data you've navigated into a position where there's nowhere to identify both the type of the data and how to get it as a multiblock data structure without another level of indirection. However, that level of indirection could similarly be used instead of CIDv2 entirely (see alternative proposal).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere else do we use codec identifiers as data types, we use them as deserialization types.

Using a codec identifier as a datatype isn't an essential (or actually even intended) part of this proposal, so I'm absolutely happy to make any changes you suggest to better align with how multicodecs should be used.

For example, replacing the 0x3e7ada7a example codec for ipld-schema with dag-cbor, and in general specifying that the metadata codec should refer to the metadata format would be totally fine for the intended use-cases

}
```

And thus we could then create an unambiguous hash to `Trit::True` with

```
CidV2 {
data_codec: 0x71,
data_hash: Ipld::Num(1).hash(),
meta_hash: trit_schema.hash(),
meta_codec : 0x3e7ada7a,
}
```
without having to reserve anything new on the multicodec table.

Modified spec file contains the following changes:
- [Added a definition for CIDv2](https://github.com/yatima-inc/cid/blob/master/README.md)
- [Added an implementation for CIDv2 to rust-cid](https://github.com/yatima-inc/rust-cid/tree/cid-v2)

## Test fixtures

| version | data multicodec | data multihash | metadata multicodec | metadata multihash | base32lower CIDv2 |
|-|-|-|-|-|-|
| cidv2 | raw | sha2-256-256-f3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb23adc8b7 | raw | sha2-256-256-fea3bd73e2b506e00527232b3ed743c066da83a8e3066f62a71e75eb9b4aa1db6 | bajkreib2n2yhsdzzvsd4stzyk2zn2lc5cehgqelaejq2tkjd2o5shloiw5kreihkhplt4k2qnyafe4rswpwxipagnwudvdrqm33cu4phl243jkq5wy |
| cidv2 | raw | sha2-256-256-f3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb23adc8b7 | identity | identity-4-6d657461 | bajkreib2n2yhsdzzvsd4stzyk2zn2lc5cehgqelaejq2tkjd2o5shloiw4aaabdnmv2gc |



## Design rationale

This design was motivated by the desire to encode additional metadata into CIDs from a number of projects, such as [Yatima-lang](https://github.com/yatima-inc/yatima-lang), [Lurk-lang](https://github.com/lurk-lang/lurk-rs), DAG House, and IPNS-Link (see https://github.com/multiformats/cid/pull/49)

In the case of Lurk, a tagged hash-pointer called `ScalarPtr` contains a 16-bit tag describing the type of node in the scalar graph of language terms. This tag must be included in the CID somehow in order to retrieve individual nodes without re-traversing the entire graph, so unless Lurk reserves each multicodec table entry beginning with a given 16-bit prefix (e.g. `0xC0DE`) it would be difficult if not impossible to have a CID containing both the Lurk data and its associated tag. If we then think about every other protocol which needs to include similar tags, types, or pointers in addition to their data, the multicodec table quickly becomes saturated with hundreds of entries for each application and runs out of 9-byte space.

### User benefit

Having arbitrary-length CID metadata allows the data to be fully self-describing and abstracts application-specific interpretation away into the metadata CID.

### Compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this indicates the scope of what in the ecosystem will be effected by this change. The current text makes it appear as though introducing this new version of CID will be fairly trivial when that's not quite the case.

Some example ramifications:

  • Existing CID parsers would need to be updated to support CIDv2, or else would error
  • Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s.
  • Tooling that only supports CIDv1 could break if any node being accessed within the graph contains a CIDv2. This could provide a problematic UX for tools that say only take a root CID and assume they can operate on a graph
  • Existing IPLD tooling may need to be upgraded to support the new type of links and expose needed information to users

Many of these are just the cost of doing upgrades in general, or the cost of adding metadata to links, but we should accumulate these and know what we're getting into here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing CID parsers would need to be updated to support CIDv2, or else would error

Yes, but they would error cleanly, since afaiu parsers already have to match on the version varint. But I don't think its particularly complicated change to add a case for version 2, as I did in multiformats/rust-cid#123

Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s

The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.

Unfortunately not. The base36 encoding of a SHA2-512 raw CID is too long to fit into a URL subdomain. e.g. https://cid.ipfs.tech/#kf1siqqaod24wzk1b0jwakpjxj8z9xaqxwh56nnc267oznfqrm8cc0w0f36g6ir7zb1tuso6ch7kg3at9o6bnr8lm34hty32o1l0ljycu is 105 characters which is greater than the 63 character DNS label limit.


For backwards compatibility, the existing CIDv2 codec `0x02` could be used to allow interpretation by legacy CIDv1 application logic, e.g.
```
CidV1 { multicodec: 0x02, hash: <identity-multihash-of-cidv2-serialization> }
```

In the canonical CIDv2 form, the data comes before the metadata because a legacy CIDv1 parser can choose to keep only the former and discard the latter.

### Security

There is likely some increased memory overhead from supporting double-wide CIDs, but this should not be significant when comparing CIDv2s of 256 bit multihash versus CIDv1s with a 512 bit multihash.

The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely.

### Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: <data-cid>, Metadata: <metadata-cid> } or { Data: <data-cid>, Metadata: <metadata-cid>, Type: <whatever-type-info-you-want> }

This could be encoded as a CIDv1 in DAG-CBOR, or using any other format you wanted.

Some advantages:

  • It doesn't require bumping the CID version and as a result a lot of tooling can be left alone
  • Your type data can be more than a single block without requiring an extra level of indirection
  • You can specify what your data is without reserving a code in the table for every data type you could want.
    • Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.

Some disadvantages:

  • It takes up a couple more bytes
    • It's more than a few bytes if you want to be self-describing, but in theory an application could just have a tuple of CIDs which is fairly minimal overhead. This makes the data not self-describing, but it's not in the current proposal either
  • A given application or ecosystem needs to decide on how to encode their metadata/type information
    • This needs to happen in the current proposal anyhow, but in the current proposal developers don't have to think about how to disambiguate data from metadata just how to actually encode their metadata

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: , Metadata: } or { Data: , Metadata: , Type: }

You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash. In the write-up I did for @vmx I go into some detail about why for Lurk we need to have the metadata tags in the pointers themselves: https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7

Your type data can be more than a single block without requiring an extra level of indirection

For large metadata, I think having a hash of the metadata is unavoidable. The advantage of this CIDv2 proposal though is that since a CIDv2 is isomorphic to a pair of CIDv1s, you can store your metadata and data in the same content-addressed store with self-describing keys. We do this in Yatima where we have large data and metadata trees for program ASTs: https://github.com/yatima-inc/yatima-lang/blob/35f868ab05a4059690e6da9db2e5c4419537fcd0/Yatima/Datatypes/Cid.lean#L23

So this proposal supports both large metadata (like Yatima's full metadata CIDs) and small metadata (like Lurk's 16-bit tags)

Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.

I think what would make sense if this proposal is adopted to allocate a single metadata multicodec for each application, whether that's IPLD Schema, Lurk, Yatima, etc., and then each application would have its own logic of what its own metadata means. E.g.

name tag code description
dag-cbor ipld 0x71 MerkleDAG cbor
... ... ... ...
ipld-schema ipld 0x3e7a_da7a_0001 an IPLD Schema DML in dag-cbor
lurk-metadata lurk 0x3e7a_da7a_0002 A Lurk tag in the identity multihash
yatima-metadata yatima 0x3e7a_da7a_0003 A hash of a Yatima metadata AST

This has a similar effect as allocating ranges in the multicodec table, but without the centralized overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash.

I read through the writeup, but still don't understand. What's the problem that you run into if instead of something like

<0x02><lurk-data-code><lurk-data-multihash><lurk-tag-code><lurk-tag-identity-multihash>

you had

taggedLink = EncodeDagCbor([<0x01><lurk-data-code><lurk-data-multihash>, <0x01><lurk-tag-code><lurk-tag-identity-multihash>])

<0x01><0x71><identity-multihash-of-taggedLink>

It seems like the bytes would be almost the same, and any code working with lurk data would already know how to do the conversion of the CIDv1 into two different objects and the use of identity multihashes saves you from doing any repeated hashing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that the lurk tags are not legible from taggedLink without traversing the pointer. In the Lurk case, this might be impossible if we're pointing towards a private input.

@porcuquine, @vmx and I had a long discussion on the Lurk discord about why this is necessary: https://discord.com/channels/908460868176596992/913200327547822110/964156408490754058

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like ...

My worry about just pushing as much as we can into CIDv1 is that we end up losing the utility of the CID because it just becomes a way to squish in arbitrary data to a point in a block. One of the main purposes of a CID in IPLD is to provide clear linking semantics between blocks. If we overloaded CIDv1 and hid the actual content address of the link in an inline portion of it then even though the blocks might load fine in existing systems, the DAG disappears because the links aren't links anymore. We end up at the same place as a CIDv2 of having to update all our systems to interpret this new thing, and while it may be less painful and give us more time to adjust, it also gives us lots of space to not upgrade at all—or to give edges of our ecosystem space to not upgrade. Turning DAGs into collections of arbitrary blocks.

The choice would be something like: would you rather push your DAG to pinning service where you don't know if they support the new inline CIDv1-with-embedded-link, and therefore, just in case, you have to push them each block one by one and get them to pin each block individually. Or, have the pinning service error with "unknown CID version: 2" and move on to a different pinning service, knowing that you just want to pin a root and they'll take care of the DAG connectivity.

I think I'm on team just accept the pain and upgrade all the things even though it's going to take time. I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.

I think that makes a lot of sense. While my initial thinking was that CIDv2 would an optional extension that would live alongside CIDv1, I think that there's certainly a way to modify CIDv2 to have it work as a CIDv1 replacement.

Specifically I think what I would want to do is

  pub struct Cid<const S: usize, const M: usize> {
    /// The version of CID.
    version: Version,
    /// The codec of CID.
    codec: u64,
    /// The multihash of CID.
    hash: Multihash<S>,
    /// metadata multicodec
    meta_codec : Option<u64>
    /// metadata multihash
    meta_hash : Option<Multihash<M>>  
}

And then we would need a bit to switch on whether the cid has metadata or not:

<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data>(<multicodec-metadata-content-type><multihash-metadata>)

or

<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><has-metadata-varint>(<multicodec-metadata-content-type><multihash-metadata>)

where everything in the parenthesis is present if has-metadata = 1 and absent if has-metadata = 0.

If we don't want to add a whole extra varint for a single bit though, as we could actually switch on the version varint, where Version::V1 has no trailing metadata and Version::V2 has mandatory trailing metadata. That's maybe more in the same vein as "CIDv2 as optional extension for CIDv1" though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the struct but you're right about the optionality - I don't know if I have an opinion yet on having an additional bit vs making metadata mandatory for v2 and therefore requiring a v1 where there is no metadata. A third-way would be to make it mandatory if you're using a v2 but allow for the metadata to be 3 zero-bytes [0,0,0] (codec=0, hasher=identity/0, digest length=0) which would be equivalent to the v1 form - 3 wasted bytes instead of a single one for a flag, but you still get to choose whether you use a v1 to save those bytes.

One thing that continues to bother me about this (I mentioned this in the other thread) is that I lose the ability to inspect initial bytes to see what's coming. Currently we can do this with just enough bytes to read 3 varints: https://github.com/multiformats/js-multiformats/blob/dcfdac59df3570b85e633afae5ac8f6caf0a4441/src/cid.js#L312-L324

Arguably the utility of this isn't as great as it seems, but I'd probably have to remove that function, or make it throw, or something else in the case of a CIDv2. Its main use is in decodeFirst() (function defined just above) which is basically the same as: https://github.com/ipfs/go-cid/blob/802b45594e1aed5be3a5b99f00991e9fa8198bfa/cid.go#L691 - the use-case being - "here's a source of bytes I know starts with a CID, give me the CID and the remaining bytes". If there were a way to make it easier to do this initial-bytes-inspection then that'd be great, but it's not a blocker. e.g. if we must have a flag for these optional pieces, we could turn it into a "full length" varint and put it near the front; for common cases I think we'd still fit that in a single byte so it wouldn't be a massive waste. 🤷


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is if instead of redefining CID we redefined what Link means in the IPLD Data Model.

From what I can tell CIDs are used in primarily two places:

  1. As the descriptions of objects that users and applications pass around (e.g. ipfs://<cid>)
  2. As the internal links inside of DAGs

Given that the object descriptions always have their own custom meaning anyway (e.g. ipfs:// currently is approximately equal to "try seeing if the data is UnixFS", ipfs block get <cid> assumes the data is an independent block, v1 of the remote pinning API assumes the CID to pin is the root of a graph, ...) adding metadata here is not particularly interesting.

Adding metadata inside of the DAG is interesting, however, changing the CID spec isn't necessary for this. You could also change what links mean in the IPLD Data Model and get the same result. Historically it appears that this was intentional, for example in https://github.com/ipld/ipld/blob/835d010583accf0dbec7f3ddbd4b6a66f86e2fa2/_legacy/specs/FOUNDATIONS.md#linked it's indicated that Links were intended to eventually allow for referring to data inside of blocks. Similar logic could extend to allowing for other kinds of type information there as well.

  • Advantages:
    • No need to bump the CID version and so a lot of existing tooling can be left in place
    • Type data can be more than a single block without requiring an extra level of indirection
    • It's not necessary to define codes in the table for your types if you don't want to
  • Disadvantages:
    • Shares disadvantages with the current proposal regarding the need to rework tooling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPLD already feels a little like a second-class citizen in a lot of IPFS implementations, and I worry that breaking the identity between CID and IPLD::Link would just exacerbate that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can that both be true and this CIDv2 proposal be relevant? If you take the position that non-IPLD things are second class then what you're left with is basically UnixFS and then what are these tags going to do for UnixFS data? In order for the tags to be useful the IPLD tooling is going to need to expose it anyhow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take the position that non-IPLD things are second class

That's not my position. I was observing that in e.g. the IPFS http api we have two parallel sets of calls for ipfs block and ipfs dag: https://docs.ipfs.tech/reference/kubo/cli/#ipfs-dag, with the latter being generally less well supported.

Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD

How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information? In your example how would you expect either of kubo's block or dag commands to change to benefit from CIDv2 without having IPLD tooling support?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't really expect the kubo commands to change much, the additional information in a CIDv2 is primarily intended to be used at the application level.

How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information?

Specific IPLD libraries like rust-cid will support extracting/manipulating the tag information, and that should be enough for the specific use-cases of CIDv2 I'm aware of

- [CIDv2 with arbitrary-precision multicodec size](
https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#appendix-a-cidv2-and-arbitrary-precision-multicodec)
- CIDv2 with nested hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you detail this a bit more? Is this just allowing the CIDs inside the CIDv2 to also be CIDv2's rather than restricting them to CIDv1s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal the contents of CIDv2s are not CIDv1s, but rather the broken apart multicodec-multihash pairs. This is specifically to mitigate the issues with nesting raised in the previous discussion multiformats/cid#49.

The other idea of arbitrary-precision multicodec is to figure out how to safely remove the 9-byte limit on multicodec-varints (such as by adding a size field), and then managing larger metadata tags by allocating ranges on the now infinite multicodec table. However, that solution requires both technical changes to implementations, as well as process changes to how multicodec is managed, whereas the current IPIP should largely only require the former.


### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).