-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SoftWare Heritage persistent IDentifiers #203
base: master
Are you sure you want to change the base?
Conversation
0ffc60a
to
6c36a25
Compare
From having a quick look at the spec you linked, it seems that they are identifiers and not codecs (in the IPLD sense). The multicodec within a CID is about how the data is encoded and not what contextual meaning it has. If I understand it correctly the SoftWare Heritage always points to Git blobs, then the codec for the CID to use would indeed be |
The only way to to embed SWHIDs losslessly in IPLD is to have 5 codecs for each of the 5 possible prefixes before the hash of a core SWHID. |
This might be its own IPLD Codec.
Yes, this information would need to transmitted out of band. Similar to |
Trying to transfer the information out of band is rather sub-par though. Remember that these objects are not just leaves, they contain SWHIDS themselves (after decoding). SWHID snapshots point to the other git objects, and when one dereferences a directory, one gets full SWHIDs rather than mere git hashes (the permission bits in the git directory object in this case provide enough info to recover a full SWHID for each entry). A goal here would be to try to provide access to software heritage's archive over IPFS. But if all this logic has to happen out of band, then we need to modify IPFS in an arbitrary way? It's much nicer to just make a codec that that does this work, and fits everything within the normal IPLD data model without extra steps. It also means that a bitswap query doesn't need to be amplified into 4 SWHID queries if the mirroring is done on demand.
To be clear, this "how to decode" vs "semantics" of the data is purely human interpretation. If the git object grammar is something like:
|
Well, yes, this is unfortunately a limitation of CIDs as they are currently framed, but there has to be a bound somewhere in how much context you can jam into an identifier and the current incarnation of CID defines the codec portion as something like: what piece of decoding software do I need to reach for to turn the arbitrary bytes associated with this identifier into something not arbitrary. As you point out, the boundaries of this are a bit squishy because the same bytes could be interpreted in different forms at different levels. We try our best though to keep CIDs at the most basic level, which I think is the basis of @vmx's objection. Additional context that involve questions of where, why or how this object fits into a larger picture are (mostly) out of scope of CIDs. Let's take one of the SWH examples: https://archive.softwareheritage.org/browse/content/sha1_git:94a9ed024d3859793618152ea559a168bbcbb5e2/raw/ If I take that file locally and calculate its SHA-1 on it, I get this:
So for that file as a blob, I could make a CID that takes that digest, wraps it in a SHA-1 multihash, then wraps that in a I could then add it to a git repository and see how it bundles it:
And now we see the
So a CID that wraps What happens beyond that is out of scope for CIDs unfortunately. If you want to bring some special context about what to do with this file then that has to come from elsewhere—which is a very common activity in content addressing. Just because I have a file that has the text contents of the GPL doesn't tell you anything about why I have it or how it fits into a larger picture. I could be using it to say "this is my project's license" as a wider project that compares different licenses, or maybe it's part of my test fixtures—this is context that's important to the collection of data that must sit outside of how you identify a single piece of a large content-addressed bundle of data. Typically you bring your context with you as you navigate down to individual pieces - if my GPL file is part of my text fixtures then the context would likely come via some directory structure that I've built to contain it, but I can't fit that directory structure into a CID, it has to come before the CID and the application that is consuming my collection of blobs is building the context as it goes. I hope that explanation helps get toward the heart of the objection here. The codec field for a CID should tell you how to extract the usable data from raw bytes. Regarding your comment:
The SWH recognises this at https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#git-compatibility, but does point out that snapshots are something different. Do you know anything about what the binary format of these objects that are being hashed are? There may be a case for adding a codec for this missing one if it's something that can't be covered by an existing codec (i.e. new code would have to be written to decode such an object, not just re-using existing code, like for Otherwise, wrapping these SWH identifiers in SHA-1 multihash and |
Quick nonfinal thoughts:
So what's the balance, in this case?
|
I had another closer look at the SWHIDs. At the section about how to compute the identifiers it reads:
This is exactly what CIDs are about. CIDs are computed from the object itself and contain just enough information to make it interpretable. Ideally objects are self-describing and if I understand things correctly all of your objects are, else you couldn't compute your identifier. If you use the It's similar what to what IPFS is doing with the filesystem (UnixFSv1) implementation. UnixFSv1 contains information about files and directories (like your "directories", "contents", etc). This specific information is then wrapped in a more general purpose container called DAG-PB, where you use Git Objects instead. To the outside Even the |
But if git officially decides to call anything else a snapshot, then IPLD is in a very unfortunate situation. |
Indeed, and I am sympathetic: a multicodec like "prime number" would be a classic violator here, because validating at codec times would be quite expensive. Even worse would be "valid RSA public key" as a codec. These are types that clearly violate that maxim and don't belong as codecs. Now, I am very sympathetic that it feels a little icky to be adding these "subcodecs" when all the extra processing could be done out-of-band. And if SWH were in the process of designing SWHID v1 right now, I too might ask them "do we really need this information in the reference and we already have a nice tag in the referee?" But at least we can cheaply incorporate the distinction here into the codec: instead of dispatching on that tag, just require it to be an exact match and take that branch. This makes it a far less bad offender than "prime number" and "valid RSA public key": they are types, but types a clear and efficient interpretation for a codec. Also, precisely because it overlaps with git-raw, it shouldn't be hard to implement as we can reuse the code that exists. And finally, let me step back a bit from "pure engineering" considerations and appeal to the real world context: As @warpfork says:
There are my exactly thoughts, too. IPFS (and filecoin) has long done good work with major internet-accessible archives. SWH alone however is the only major archive I'm aware of which is already totally on board with the principles of content addressing, Merkle DAGs, etc. There is the possibly here not just to migrate data from the archives and store on IPFS nodes, or even builds an archive-accessing app that uses IPFS beyond the scenes, but to turn the SWH archive into bonafide "super nodes" in the IPFS network that share the archive using just IPFS regular interfaces (bitswap etc.,), are accessible to regular IPFS nodes, and can be easily embedded in other IPFS data. Accepting SWHIDs as a fait accompli for anything that wants to so deeply bridge with SWH, but that is also the only impediment. On every other consideration, SWH and IPFS are in perfect alignment. Maybe when it's time to make SWHIDs v2, the embedding in IPLD well be an paramount consideration from get-go :). |
Good point. I also looked at the Git implementation again and it does indeed decode the whole thing (and not just the container) so it would make sense to have a codec for the snapshots. |
Sorry for showing up a bit late but I’m going to have to go back over a few things other people covered because I have some different recommendations. It looks like (I could be wrong since I haven’t seen the implementation) that the various codec identifiers are not variations in block format (each one can be parsed with an identical parser without any out-of-band information). Each one is meant to contain some typing information so that the link itself is more meaningful. This is something we try not to do but have been forced into accepting in a few cases. To be clear, we recommend against this because it’s just not the best way to add type information to linking and we try to steer people to better alternatives, we aren’t trying to over-police the codec table. It’s hard to see when just looking at multicodecs and CID’s but it’s actually expected that context about linked data is held in a parent linking to that subtree. It’s not the case that we expect applications to be able to fully interpret that application’s meaning of a link by looking only at the link’s multicodec. There may be context about the link inferred from the multicodec and the mulitcodec must tell us how to parse the linked block data, but there’s often more information than this that an application will need in order to figure out how to use the link. Ideally, I’d like to see this paired down to one or two new multicodecs (not paired down to just the git multicodec as has been suggested) by putting the additional typing information (snapshot, release, revision, directory or content) somewhere else.
|
@mikeal thanks for your input. Seeing your link to this in ipld/specs#349, I left a comment there describing my understanding of the principles for multicodecs ipld/specs#349 (comment), and indeed it is ambiguous whether having this information in the multicodec passes that litmus test. To me, the real deciding factor here is not technical but social. As @warpfork mentions, SWH is "doing good stuff for good reasons" and "they have some significant and relevant adoption". With the bare minimum of multicodecs: |
To get really specific about what "first-class" interopt might mean, what I'm envisioning is a modified IPFS node which knows how to relay Bitswap requests responses to/from SWH's native interfaces. I call this node the bridge, since the requests and responses are 1-1 and the translation is stateless. Regular IPFS can connect to this node (certainly by manual intervention, and hopefully eventually DHT if we also make the effort to populate it), and then do normal IPFS things, and accessing SWH-provided data will transparently work. I actually think the answer to @mikeal's two questions in principle "yes": there is enough information in both the parent and child blocks alike. But, Bitswap requesting is by CID: the bridge won't have access to either the parent or child blocks when it goes to translate the request. Any other way to try to get the information from the parent and child blocks to the bridge node or from the node making the ultimate request seems strictly worse to me. The code is the one place where format-specific logic is supposed to be; everything else is supposed to be format agnostic (dag-pb backcompat notwithstanding). If off-the-shelf IPFS is to work effortlessly with the existing SWH data model and existing storage and retrieval infrastructure, enough new multicodecs to faithfully translate SWHIDs to CIDs seems the least invasive way too do it. |
OK, so it seems to me that we're boiling down to the "content routing" problem here, would that be correct @Ericson2314? That it's not practical to just throw all of the objects in SWH into the IPFS DHT but instead the CIDs themselves should provide a hint for where to go to retrieve such objects. Or something like that. This came up recently for likecoin, discussion from here onward: #200 (comment) including @aschmahmann's excellent input from the go-ipfs side. Is content routing the primary need here? Your OP suggested a bi-directionality problem, with loss of information:
I can see this being true for snapshots, but it's not strictly true for the other types, is it because |
Routing yes, but forwarding in particular. Populating the DHT is great, and I hope to see other IPFS nods mirror popular data, but the main thing is being able to translate bitswap to the SWH APIs with the bridge node. For that least bit, we can ignore whether someone is manually connecting to the bridge node or got there via a DHT entry.
Err it's my understanding that only small objects reside directly in the DHT, and otherwise the DHT just tells you what nodes to try dialing? In any event, I'm not against the DHT and storing things natively on regular IPFS nodes, I just want the basics to work first. It's very much analogous to using IP just for the internetwork, building momentum, and then trying to use it for the LAN too :).
Exactly! See https://docs.softwareheritage.org/devel/swh-graph/api.html and https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html for the interfaces in question.
Yes it indeed isn't strictly-needed information. I don't actually know the engineering backstory. I could forward the question if you like. |
I've taken a closer look at how the SWHIDs are generated as I see the need to be able to go from something stored in IPFS to a SWHIDs and back. I've read some of the docs and a bit of the source code. @Ericson2314 please let me know if I understand things correctly. Let's leave the snapshots aside for now. The other four types have a 1:1 mapping to Git objects. They are not only like Git objects, they are actual Git objects. So if you just look at the bytes of the objects, you couldn't tell whether they come from the SoftWare Heritage system or from some Git repository. Is that correct? The mapping is (I'm using the identifiers Git is using in their objects):
So with having a CID, which contains the hash that also the SWHID is using and with looking at the data (which is compressed, which makes things more complicated) one can construct a full SWIHID. Is that correct? |
@vmx I think that is all correct, expect for small part about compression (I should have mentioned this earlier in the thread, but I forgot). It is my recollection from the IPFS and Nix work work that uncompressed data is hashed with git. Still, SWHIDs and Git are in agreement on the compression part, whatever the answer may be, so it doesn't subtract from your larger point. |
(Also "git tag hashes" are very hard to find information about. I'm not sure what's up with that.) |
I don't understand what you are referring to :) |
Oh I just mean I had never heard of tag hashes before reading them in the SWH docs, and I have a hard time finding hashes associated with tags in the wild that aren't just the that of commit being tagged. This is something I should ask SWH about, but wanted to mention it here in case someone else is confused as I am. |
@Ericson2314 here is one well-known use + explanation |
Ironically github lists it, but can not render it |
I see https://git-scm.com/docs/git-tag. I guess i was confusing lightweight and annotated tags. Thanks @ribasushi. |
@Ericson2314 This PR has been stale for a while, so I want to make sure you're not blocked. As the discussion at #204 shows, there still need to be things figured out, what exactly qualifies as codec and whether your use case should fall into that or be its separate thing. The only thing currently everyone in this discussion agrees on, is having a codec for SWH Snapshots. Can you move forward with the developments you've planned when we only add that codec and use |
Thanks @Ericson2314 for being so patient and working together on that. I really hope we find a good solution for maximum interoperability for all systems involved. |
I see that @Ericson2314 has tersely summarized the reason to include the additional four codes in the message of the most recently rebased commit pushed here. I see no reason to oppose the use of these code numbers. Does anyone want to object to merging these? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, I do not see a reason to oppose merging all of these codes. While we have had some interesting discussion of the subtleties here, ultimately, this range does not seem overly precious to me, and if there is code and community that will use them, then I support enabling that.
I suppose now that there is the draft vs permanent distinction, there is a the option of initially adding all 5 but down the road making snapshots permanent before the other 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still objecting it, as mentioned in this discussion and also at #204 (comment) I think there should only be one codec be added (the one that doesn't have a corresponding one in gi) and the other ones should use the git-raw codec for maximum interoperability. I still think the problem SoftWare Heritage has is a real one, but shouldn't be solved by an new IPLD codec.
Sorry for the last comment. I wasn't fully concentrated and though we are about to merge this PR, but the e-mail I got was about #207, which I already approved and is totally fine. Sorry for the noise :) |
And even more confusion, so I'd prefer merging #207, but not merging this one as mentioned above. |
Just doing #207 unblocks us so that's fine for now. |
I wrote some additional thoughts down that I've been having regarding this (and related) topics but decided to post it in a gist because it's not short and maybe best not to clog this thread up: https://gist.github.com/rvagg/1b34ca32e572896ad0e56707c9cfe289 I think in there I might be suggesting one or two ways ahead with this PR:
|
There seems to be some misunderstanding that this is strictly about which code ranges are valuable vs not, it's not. If it was then people would just suggest a higher range for these values. I too think that this is not a good idea. As @rvagg mentioned earlier we get these requests to basically embed content routing hints in the CIDs every so often and generally they're not a good idea (e.g. #200 (comment)). I also added some comments to the more general "what is an IPLD codec" thread #204 (comment). The TLDR there is that these location addressing/hinting codecs proposal generally plan to integrate with go-ipfs by leveraging quirks of how Bitswap works today that may not continue to be there once we fix some outstanding issues. Saying yes here potentially sets up the proposer for failure by encouraging them to build a system that there is no compatibility guarantee for due to them abusing the codec and CIDv1 in general. My major objection here is basically due to wanting to setup SWHIDs' interoperability with go-ipfs, and the rest of the IPLD ecosystem, for success and require relatively low software maintenance burden from the contributors, without adding new requirements to base elements of the stack. It sounds like for the time being #207 should be enough to move things along. If we run into more problems we can revisit if it's worth doing especially given the unsupported nature of the proposal. |
See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html for the technical details of that spec. These remaining four sorts of objects coincide with `git-raw`, which is already in this table. However, adding them here is not redundant, because the deserialization direction. While a git-raw CID may legally point to any sort of git object, the relevant SWHIDs specify in the identifier what sort of git object is expected, and if it points to a different type of valid git object, deserialization fails. That means there is no way to losslessly convert a git-coinciding SWHID into a git-raw CID.
See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html for the technical details of that spec.
Software Heritage has done a superb job promoting content addressing in general, and their identifier scheme (SWHIDs, for short) in particular. By supporting them in CIDs / IPLD, I hope the IPFS ecosystem can align itself with that effort.
Per the linked documentation, SWHIDs have their own nested grammar and versioning scheme. I have taken the version 1 core identifier grammar, unrolled it, and replaced
:
with-
per the guidelines on separators, with the result being these 5 rows.Also note that some of those schemes coincide with certain forms of
git-raw
, already in this table. However, adding them here is not redundant, because the deserialization direction. While a git-raw CID may legally point to any sort of git object, the relevant SWHIDs specify in the identifier what sort of git object is expected, and if it points to a different type of valid git object, deserialization fails. That means there is no way to losslessly convert a git-coinciding SWHID into a git-raw CID.Contains #207.