Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Jcb/serde minimal #147

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

johnchandlerburnham
Copy link

@johnchandlerburnham johnchandlerburnham commented Apr 30, 2022

Currently we serialize structs and enums as maps with the field names included in the data:

  struct Person {
    name: String,
    age: u8,
    hobbies: Vec<String>,
    is_cool: bool,
    link: Cid,
  }

  impl Default for Person {
    fn default() -> Self {
      Self {
        name: "Hello World!".into(),
        age: 52,
        hobbies: vec!["geography".into(), "programming".into()],
        is_cool: true,
        link: Cid::try_from(
          "bafyreibvjvcv745gig4mvqs4hctx4zfkono4rjejm2ta6gtyzkqxfjeily",
        )
        .unwrap(),
      }
    }
  }

becomes

{"age": 52, "hobbies": ["geography", "programming"], "is_cool": true, "link": bafyreibvjvcv745gig4mvqs4hctx4zfkono4rjejm2ta6gtyzkqxfjeily, "name": "Hello World!"}

Which sorts the keys via lexicographic order of the strings, due to an interior use of BTreeMap. Order preserving codecs like dag-json will preserve this order, whereas codecs like dag-cbor will impose their own strict order. In this snippet https://gist.github.com/johnchandlerburnham/efcce4e52f820d10d5fcc96ef7b8cce6 we can see an example where by converting a Rust type to Ipld, then to cbor, then to json, we can obtain a different ordering of the fields, and thus a different hash.

Instead, we can encode structs as tuples by default (similar to https://github.com/kardeiz/serde_tuple) and enums variants according to the order of their source position.

This has the further benefit of allowing the unit type () and the unit struct to be embedded as Ipld.

Additionally, this PR alters the embedding of Rust map-like collections to use Ipld::List rather than Ipld::Map. This allows maps with non-string keys to be embedded.

By removing all occurrences of Ipld::Map from the default serde serialization, we can then solve #142 by using maps in the Serde data model, as a dynamic hint to indicate a Cid, similar to the dag-json {"/": "bafy..."} technique.

@johnchandlerburnham johnchandlerburnham changed the title Jcb/serde minimal WIP: Jcb/serde minimal Apr 30, 2022
@vmx
Copy link
Member

vmx commented May 2, 2022

Which sorts the keys via lexicographic order of the strings, due to an interior use of BTreeMap. Order preserving codecs like dag-json will preserve this order, whereas codecs like dag-cbor will impose their own strict order. In this snippet https://gist.github.com/johnchandlerburnham/efcce4e52f820d10d5fcc96ef7b8cce6 we can see an example where by converting a Rust type to Ipld, then to cbor, then to json, we can obtain a different ordering of the fields, and thus a different hash.

I don't understand that example. The IPLD Data Model doesn't impose an ordering. It depend on the codec you use. You'd also need to use the "DAG" version of those codecs, as the are targeted at generating the same hash, if the input data is the same (as good as possible). The hash is always built from the encoded data, not of the data model representation. So if you go Rust type -> Ipld -> DAG-CBOR -> DAG-JSON -> DAG-CBOR, the DAG-CBOR encoded data should be the same hash. The same if you flip DAG-CBOR and DAG-JSON. If not, then it's a bug.

Instead, we can encode structs as tuples by default (similar to https://github.com/kardeiz/serde_tuple) and enums variants according to the order of their source position.

Yes, using tuples is one way to preserve ordering across encodings if you need that.

There is a new library called serde_ipld_dag_cbor, that should work better with Serde. I think you can get the same effect with serde_tuple. I'll give it a try and see if it results in the same outcome. As we now have such a Serde based DAG-CBOR implementation, I'd hope that those kind of transformations can be done on the Serde level, rather then on the IPLD level.

@johnchandlerburnham
Copy link
Author

So if you go Rust type -> Ipld -> DAG-CBOR -> DAG-JSON -> DAG-CBOR, the DAG-CBOR encoded data should be the same hash. The same if you flip DAG-CBOR and DAG-JSON. If not, then it's a bug.

I'm not sure if there's actually a way to go directly from DAG-CBOR to DAG-JSON in the way you're describing. My understanding is that DAG-CBOR is a codec relating the Ipld and Cbor data models. I.e, we can think of DAG-CBOR as the pair of an Ipld -> Cbor and a Cbor -> Ipld function (ignoring erroring). Likewise DAG-JSON is the pair of an Ipld -> Json and a Json -> Ipld function.

The issue is that Cbor and Json are independently interconvertible to one another, without any reference to Ipld. So we also have Cbor -> Json and Json -> Cbor.

image

Ideally we would like this diagram to commute, but what my snippet shows is that depending on where a Rust type enters this diagram via serde, this might not be the case. Because JSON is order-preserving, and CBOR is strict-ordering, it is possible to cause the ordering of objects to change. I agree that this can't happen if you only convert to/from Ipld (which I think what you mean in your example).

My preference would be to have the default serde Serializer use tuples by default. If we have to use the Serialize_tuple and Deserialize_tuple macros from serde_tuple, that's workable, but given that unintentional re-ordering breaks canonicity, I think tuples are more in line with the content-addressing objectives of Ipld.

Furthermore, the field labels are not actually relevant to content-addressing, per se. And I think it's strange that e.g. changing the name of a field in a Rust type, would cause its hash to change. There ought to be a mechanism for providing this kind of schema or type information out-of-band, like through the CID codec field.

@vmx
Copy link
Member

vmx commented May 4, 2022

I'm not sure if there's actually a way to go directly from DAG-CBOR to DAG-JSON in the way you're describing.

Currently not. But hopefully we will also have a Serde based DAG-JSON coded, which then enables direct DAG-JSON <-> DAG-CBOR conversion.

Because JSON is order-preserving,

DAG-JSON also has a specific key order, it just happens to be the same as the Rust's BTree order. In IPLD the order is specified by the codec, to make sure that the same data, when encoded, leads to the same hash.

My preference would be to have the default serde Serializer use tuples by default. If we have to use the Serialize_tuple and Deserialize_tuple macros from serde_tuple, that's workable, but given that unintentional re-ordering breaks canonicity,

Rust structs align well with IPLD Schema structs. There the default representation is a map, but you can chose a different representation, which is exactly what you are looking for. In your use case you don't care about the field names of the struct. I would try to solve this on the Serde level.

I think tuples are more in line with the content-addressing objectives of Ipld.
Furthermore, the field labels are not actually relevant to content-addressing, per se.

It depends on your application. If you want self-describing data without the need of an external schema, then you need the field names, else you wouldn't know what the data refers to.

@johnchandlerburnham
Copy link
Author

DAG-JSON also has a specific key order, it just happens to be the same as the Rust's BTree order. In IPLD the order is specified by the codec, to make sure that the same data, when encoded, leads to the same hash.

From the spec:

Sort object keys by their (UTF-8) encoded representation, i.e. with byte comparisons

Okay, I think this results in the same ordering as DAG-CBOR (which is based on https://datatracker.ietf.org/doc/html/rfc7049#section-3.9), because of the restriction of DAG-CBOR maps to String keys (which iirc are restricted in CBOR to always be UTF-8). Doing DAG-JSON <-> DAG-CBOR then would mostly solve the ordering problem I raised, although I still think there's still a risk of end-users lowering their data to ordinary JSON.

At a high level, it seems reasonable to me that we would want to support multiple representation strategies in IPLD.

Rust structs align well with IPLD Schema structs. There the default representation is a map, but you can chose a different representation, which is exactly what you are looking for. In your use case you don't care about the field names of the struct. I would try to solve this on the Serde level.

It makes sense to me that the libipld serde serialization should support multiple representations. And doing that at the serde level makes sense, since then serializing via the serde_json or serde_cbor serializer won't create radically different data compared to the libipld serializer.

I'm fine with having something like https://github.com/kardeiz/serde_tuple, where you use a different derive macro to indicate the representation you want:

pub struct Foo<'a> {
    bar: &'a str,
    baz: i32
}

But I think maybe it's worth including those derive macros in libipld itself so that we can make sure they're aligned with the representation strategies in Ipld Schemas: https://ipld.io/docs/schemas/features/representation-strategies/

@vmx
Copy link
Member

vmx commented May 5, 2022

Okay, I think this results in the same ordering as DAG-CBOR (which is based on https://datatracker.ietf.org/doc/html/rfc7049#section-3.9), because of the restriction of DAG-CBOR maps to String keys (which iirc are restricted in CBOR to always be UTF-8).

Sadly not. Quote from RFC-7049:

 *  If two keys have different lengths, the shorter one sorts earlier;

So you need to compare the lengths first. This was fixed in RFC-8949, but sadly we are stuck with the old RFC for DAG-CBOR.

But I think maybe it's worth including those derive macros in libipld itself so that we can make sure they're aligned with the representation strategies in Ipld Schemas: https://ipld.io/docs/schemas/features/representation-strategies/

Yes, similar to the macros libipld currently has for the non Serde based version.

@vmx
Copy link
Member

vmx commented May 5, 2022

Furthermore, the field labels are not actually relevant to content-addressing, per se. And I think it's strange that e.g. changing the name of a field in a Rust type, would cause its hash to change.

One more thought on this one. The same argument applies if you see a struct as a key-value map. Then it would be strange that re-ordering the fields (without changing their names) would change the hash.

@johnchandlerburnham
Copy link
Author

The same argument applies if you see a struct as a key-value map. Then it would be strange that re-ordering the fields (without changing their names) would change the hash.

This is true, but I think this should be very explicitly a choice on the part of the user, since there are many cases where you might want to have named fields for developer ergonomics, but not to change the serde serialization.

I've been looking at serde_tuple a little more, and I now think that the approach there won't really work for this, since it doesn't deal with enums. Also, I think that their approach might struggle in the case of a Serialize_tuple datatype depending on an ordinary Serialize datatype.

It might be possible to address this using more complicated derive macros, but I think my approach in this PR, of just defining an entirely new Serializer trait implementation could be incorporated additively. We could gate it behind a serde-codec-tuple feature, or maybe we could have IpldTupleStruct and IpldMapStruct newtypes for Ipld.

@vmx
Copy link
Member

vmx commented May 10, 2022

I haven't had a chance yet to look into the Serialize_tuple implementation. Though I agree that having a way to serialize a struct as a list is useful. From a users perspective, I think nicest would be to have a define it via an attribute like #[ipld(repr = "tuple")]. But I don't know how feasible that it.

A problem I see with a feature flag is, that you then cannot selectively select one or the other, but all structs of your project will then be one way or the other.

Though I'm generally open to just try things out and then break APIs whenever needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants