diff --git a/docs/compare_collections.md b/docs/compare_collections.md index 778ed39..9322c45 100644 --- a/docs/compare_collections.md +++ b/docs/compare_collections.md @@ -2,8 +2,8 @@ ## Use case -- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system. -- You have two identifiers for collections you know are stored by a server. You want to compare them. +- You have a local sequence collection, and an digest for a collection in a server. You want to compare the two to see if they have the same coordinate system. +- You have two digests for collections you know are stored by a server. You want to compare them. ## How to do it @@ -19,7 +19,7 @@ Therefore, we must be able to identify that two sequence collections are identic This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. -If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match. +If you have a local sequence collection, and an digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match. ### Order-relaxed identity diff --git a/docs/decision_record.md b/docs/decision_record.md index 8fddd1b..b49bf2e 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -4,11 +4,11 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. -## Contents: +## Contents [TOC] -## 2024-02-21 Schema definition for the sequence collection attributes and process for adding new attributes +## 2024-02-21 We will specify core sequence collection attributes and a process for adding new ones ### Decision @@ -169,6 +169,19 @@ We distinguished between two types of metadata: - - +## 2023-07-12 - Required attributes are: lengths and names + +### Decision + +A sequence collection consists of a set of arrays. The only arrays that MUST be included for a valid sequence collection are *lengths* and *names*. All other possible arrays, including *sequences* and other controlled vocabulary arrays, are not required. + +### Rationale + +Debate around what should be mandatory as centered on 3 specific attributes: sequences, names, and lengths: + +At first, it feels like sequences are a fundamental component of a sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools). + +This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory component. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. ## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint @@ -201,11 +214,11 @@ We distinguished between two types of metadata: - -## 2023-06-14 - Internal identifiers SHOULD NOT be prefixed +## 2023-06-14 - Internal digest SHOULD NOT be prefixed ### Background -In some situations, identifiers are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the identifier represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent. +In some situations, digest are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent. ### Decision @@ -302,7 +315,6 @@ Thus, we introduce the idea of *inherent* vs *non-inherent attributes*. Inherent We considered using `extrinsic` to define the opposite of `inherent`, which would change it so that attributes were inherent by default; but we decided we liked the explicitness of forcing the schema to specify which attributes are to be included in the digest, because this brings clarity over the alternative, which is to assume everything is included unless it's excluded. We also liked that this makes the `inherent` keyword behave similarly to the `required` keyword in JSON-schema; if left off, we assume nothing is required. This means that in order for a seqcol schema to be valid, it must have at least one inherent attribute specified. - ## 2023-02-08 - Array names SHOULD be ASCII ### Decision @@ -442,7 +454,9 @@ It also future-proofs the serialisation method if we ever allow complex object t The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation makes sequence collection implementation more difficult in languages where the RFC is not implemented. In this cases it is valuable to note that the current specification of Sequence Collection do not require that all the features of RFC-8785 be implemented. +### Alternatives considered +We spent a huge amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a third-party standard would trump the elegance, efficiency, and other benefits we recieved from our custom encoding schema. In particular, adopting the standard would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this standard accommodates other sources that we had struggled with a bit, such as UTF-encoding. ## 2022-10-05 - Terminology decisions @@ -677,8 +691,6 @@ We need a formal definition of a sequence collection. The schema provides a mach - - - ## 2021-12-01 - Endpoint names and structure ### Decision @@ -725,6 +737,41 @@ For the `POST comparison` endpoint, we made 2 limitations to simplify the implem - [https://github.com/ga4gh/seqcol-spec/issues/21](https://github.com/ga4gh/seqcol-spec/issues/21) - [https://github.com/ga4gh/seqcol-spec/issues/23](https://github.com/ga4gh/seqcol-spec/issues/23) +## 2021-09-21 - Order will be recognized by digesting arrays in the given order, and unordered digests will be handled as extensions through additional attribuetes + +### Decision + +The final sequence collection digests will reflect the order by digesting the arrays in the order provided. We will employ no additional 'order' array, and no additional unordered digests *in the string-to-digest*. Any additional attributes designed to handle questions with order, such as `sorted_name_length_pairs`, will not contribute to the digest. Thus, to determine whether two sequence collections differ only in order will require either 1. using the comparison API; or 2. implementing additional functionality via digests outside the inherent attributes. + +### Rationale + +Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 3 competing ideas that could do this: + +A. Digest arrays in given order. + +B. Reorder all given arrays according to a single canonical order, and encode order in a separate 'order' array that provides an index into the canonically ordered arrays. + +C. Reorder each given array individually, and then provide a separate 'order_ATTR' array as an index for each array. + +D. Store each array in both ordered and unordered form. + +After lots of initial enthusiasm for option B, we determined that it fails to deliver on the promise of staying invariant when order changes, because if there is a change in any array on which the canonical order is based, this changes the canonical ordering, which in turn changes all the array digests. So these 'unordered' (or canonically ordered) digests are in fact not fit for their main purpose. We therefore agreed to discard this option. + +While options C/D skirt this issue by having a separate order for each array, so that changes in one array do not affect the digest of another, they add significant complexity as everything needs to be stored twice. + +To conclude, option A seems simple and straightforward, satisfies for a basic implementation. We thus defer the question of determining whether two sequence collections differ only in order to the comparison API, or to some other future way to do it that will not affect the actual digests (*e.g.* the 'sorted_name_length_pairs' attribute). + +### Linked issues + +- https://github.com/ga4gh/seqcol-spec/issues/5 + +### Known limitations + +For use cases that require determination of whether two sequence collections differ only in element order, option A will not provide an answer based on digest comparison alone. Instead, the query will be required to use the compatibility API, which means retrieving the contents of the array to compare them. + +Therefore, to answer this 'order-equivalence' question will require a bit more work than if unordered digests were available; however, this functionality can be easily implemented on top of the basic functionality in a number of ways, which we are continuing to consider. + + ## 2021-08-25 - Sequence collection digests will reflect sequence order ### Decision diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md index 87491b5..1c83ef1 100644 --- a/docs/digest_from_collection.md +++ b/docs/digest_from_collection.md @@ -3,8 +3,7 @@ ## Use case - -One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). +One of the most common uses of the seqcol specification is to compute a standard, universal digest for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol digest. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard). ## 1. Using existing implementations diff --git a/docs/sequences_from_digest.md b/docs/sequences_from_digest.md index 182080b..35690fc 100644 --- a/docs/sequences_from_digest.md +++ b/docs/sequences_from_digest.md @@ -3,17 +3,17 @@ ## Use case -You have a seqcol digest, and you'd like to retrieve the underlying sequence identifiers, or sequences themselves. +You have a seqcol digest, and you'd like to retrieve the underlying sequence digests, or sequences themselves. ## How to do it To look up the contents of a digest will require a seqcol service that stores the collection in a database. -### 1. Retrieving the sequence identifiers +### 1. Retrieving the sequence digests -You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence identifiers, then you're done. +You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence digests, then you're done. ### 2. Retrieving underlying sequences -If you need sequences, then you'll also need a [refget](http://samtools.github.io/hts-specs/refget.html) server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence identifiers, and then use these identifiers to query a refget service. +If you need sequences, then you'll also need a [refget](http://samtools.github.io/hts-specs/refget.html) server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence digests, and then use these digests to query a refget service. diff --git a/docs/simple_example.md b/docs/simple_example.md deleted file mode 100644 index dbf5deb..0000000 --- a/docs/simple_example.md +++ /dev/null @@ -1,4 +0,0 @@ - -# How do I use Seqcol? A simple example - -I put together a demo server. This is just a proof of concept draft: [seqcolapi.databio.org](http://seqcolapi.databio.org/). diff --git a/docs/specification.md b/docs/specification.md index 2f596b4..81e5f8b 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -26,21 +26,21 @@ A common example and primary use case of sequence collections is for reference g In brief, the project specifies several procedures: -1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol identifiers are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. -2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies a RESTful API to retrieve the sequence collection given an identifier. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. +1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digest are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. +2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies a RESTful API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. 3. **Recommended ancillary, non-inherent attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. ## Use cases Sequence collections represents fundamental concepts; therefore the specification can be used for many downstream use cases. -For example, we envision that seqcol identifiers could replace or live alongside the human-readable identifiers currently used to identify reference genomes (*e.g.* "hg38" or "GRCh38"). This would provide improved reproducibility. +For example, we envision that seqcol digests could replace or live alongside the human-readable identifiers currently used to identify reference genomes (*e.g.* "hg38" or "GRCh38"). This would provide improved reproducibility. Some other examples of common use cases where the use of seqcol is beneficial include: -1. Given a collection identifier, retrieve the list of refget sequence identifiers for the contained sequences. -2. Given a collection identifier, retrieve the contained sequences. -3. Given two collection identifiers, determine if downstream results are compatible. -4. Given a collection identifier, retrieve metadata about the collection. This may include human-readable aliases, author of the collection, links to other collections, or other metadata. -5. Given a sequence collection, compute its identifier. +1. Given a collection digest, retrieve the list of refget sequence identifiers for the contained sequences. +2. Given a collection digest, retrieve the contained sequences. +3. Given two collection digests, determine if downstream results are compatible. +4. Given a collection digest, retrieve metadata about the collection. This may include human-readable aliases, author of the collection, links to other collections, or other metadata. +5. Given a sequence collection, compute its digest. @@ -50,27 +50,27 @@ Some other examples of common use cases where the use of seqcol is beneficial in - **Array**: An ordered list of elements. - **Collated**: A qualifier applied to a seqcol attribute indicating that the values of the attribute matches 1-to-1 with the sequences in the collection and are represented in the same order. - **Coordinate system**: An ordered list of named sequence lengths, but without actual sequences. -- **Digest**: An identifier resulting from a cryptographic hash function, such as `MD5` or `SHA512`, on input data. +- **Digest**: A string resulting from a cryptographic hash function, such as `MD5` or `SHA512`, on input data. - **Inherent**: A qualifier applied to a seqcol attribute indicating that the attribute is part of the definition of the sequence collection and therefore contributes to its digest. - **Length**: The number of characters in a sequence. - **Seqcol algorithm**: The set of instructions used to compute a digest from a sequence collection. - **Seqcol API**: The set of endpoints defined in the *retrieval* and *comparison* components of the seqcol protocol. - **Seqcol digest**: A digest for a sequence collection, computed according to the seqcol algorithm. - **Seqcol protocol**: Collectively, the 3 operations outlined in this document, which include: 1. encoding of sequence collections; 2. API describing retrieval and comparison ; and 3. specifications for ancillary recommended attributes. -- **Sequence**: Seqcol uses refget to store actual sequences, so we generally use the term in the same way as refget. Refget was designed for nucleotide sequences; however, other sequences could be provided via the same mechanism, *e.g.*, cDNA, CDS, mRNA or proteins. Essentially any ordered list of refget-valid characters qualifies. Sequence collections also goes further, since sequence collections may contain sequences of non-specified characters, which therefore have a length but no actual sequence content. -- **Sequence digest** or **refget digest**: A digest for a sequence, computed according to the refget protocol. +- **Sequence**: Seqcol uses refget sequences to identify actual sequences, so we generally use the term "sequence" in the same way. Refget sequences was designed for nucleotide sequences; however, other sequences could be provided via the same mechanism, *e.g.*, cDNA, CDS, mRNA or proteins. Essentially any ordered list of refget-sequences-valid characters qualifies. Sequence collections also goes further, since sequence collections may contain sequences of non-specified characters, which therefore have a length but no actual sequence content. +- **Sequence digest** or **refget sequence digest**: A digest for a sequence, computed according to the refget sequence protocol. - **Sequence collection**: A representation of 1 or more sequences that is structured according to the sequence collection schema - **Sequence collection attribute**: A property or feature of a sequence collection (*e.g.* names, lengths, sequences, or topologies). ## Seqcol protocol functionality -The seqcol algorithm is based on the refget algorithm for individual sequences and should use refget servers to store the actual sequence data. -Seqcol servers therefore provide a lightweight organizational layer on top of refget servers. +The seqcol algorithm is based on the refget sequence algorithm for individual sequences and should use refget sequence servers to store the actual sequence data. +Seqcol servers therefore provide a lightweight organizational layer on top of refget sequence servers. To be fully compliant with the seqcol protocol an implementation must provide all `REQUIRED` capabilities as detailed below. The seqcol protocol defines the following: -1. *Encoding* - An algorithm for computing an identifier given a collection of sequences. +1. *Encoding* - An algorithm for computing a digest given a collection of sequences. 2. *API* - A server RESTful API specification for retrieving and comparing sequence collections. 3. *Ancillary attribute management* - An optional specification for organizing non-inherent metadata as part of a sequence collection. @@ -107,7 +107,7 @@ properties: names: type: array collated: true - description: "Human-readable identifiers of each sequence (chromosome names)." + description: "Human-readable labels of each sequence (chromosome names)." items: type: string sequences: @@ -115,7 +115,7 @@ properties: collated: true items: type: string - description: "Refget v2 identifiers of sequences." + description: "Refget sequences v2 identifiers for sequences." required: - names - lengths @@ -161,8 +161,8 @@ Finally, another detail that may be unintuitive at first is that the `sequences` ##### Filter non-inherent attributes The `inherent` section in the seqcol schema is an extension of the basic JSON Schema format that adds specific functionality. -Inherent attributes are those that contribute to the identifier; *non-inherent* attributes are not considered when computing the top-level digest. -Attributes of a seqcol that are *not* listed as `inherent` `MUST NOT` contribute to the identifier; they are therefore excluded from the digest calculation. +Inherent attributes are those that contribute to the digest; *non-inherent* attributes are not considered when computing the top-level digest. +Attributes of a seqcol that are *not* listed as `inherent` `MUST NOT` contribute to the digest; they are therefore excluded from the digest calculation. Therefore, if the canonical seqcol representation includes any non-inherent attributes, these must be removed before proceeding to step 2. In the simple example, there are no non-inherent attributes. @@ -189,10 +189,12 @@ b'["SQ.2648ae1bacce4ec4b6cf337dcae37816","SQ.907112d17fcb73bcab1ed1c72b97ce68"," _* The above Python function suffices if (1) attribute keys are restricted to ASCII, (2) there are no floating point values, and (3) for all integer values `i`: `-2**63 < i < 2**63`_ + Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F5*](#f5-rfc-8785-does-not-apply-to-refget-sequences). + #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. Apply the GA4GH digest algorithm to each attribute value. -The GA4GH digest algorithm is described in detail in [*Footnote F5*](#f5-the-ga4gh-digest-algorithm). +The GA4GH digest algorithm is described in detail in [*Footnote F6*](#f6-the-ga4gh-digest-algorithm). This converts the value of each attribute in the seqcol into a digest string. Applying this to each value will produce the following structure: @@ -216,12 +218,14 @@ b'{"lengths":"IOlarejnLTmdv3-CqehLpcxAR9yNeR1i","names":"g04lKdxiYtG3dOGeUC5AdKE #### Step 5: Digest the final canonical representation again using the GA4GH digest algorithm. Again using the same approach as in step 3, we now apply the GA4GH digest algorithm to the canonicalized bytestring. -The result is the final unique identifier for this sequence collection: +The result is the final unique digest for this sequence collection: ``` wqet7IWbw2j2lmGuoKCaFlYS_R7szczz ``` + + --- ### 2. API: A server RESTful API specification for retrieving and comparing sequence collections. @@ -362,7 +366,7 @@ In addition to the primary top-level endpoints, it is RECOMMENDED that the servi In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to the digest. -As long as separate implementations keep such information in non-inherent attributes, the identifiers will remain compatible. +As long as separate implementations keep such information in non-inherent attributes, the digests will remain compatible. Furthermore, the structure for how such non-inherent metadata is retrieved will be standardized. Here, we specify standardized, useful non-inherent attributes that we recommend. @@ -370,10 +374,10 @@ Here, we specify standardized, useful non-inherent attributes that we recommend. The `sorted_name_length_pairs` attribute is a *non-inherent* attribute of a sequence collection with a formal definition, provided here. It is `RECOMMENDED` that all seqcol implementations add this attribute to all sequence collections. -When digested, this attribute provides an identifier for an order-invariant coordinate system for a sequence collection. +When digested, this attribute provides a digest for an order-invariant coordinate system for a sequence collection. Because it is *non-inherent*, it does not affect the identity (digest) of the collection. It is created deterministically from the `names` and `lengths` attributes in the collection; it *does not* depend on the actual sequence content, so it is consistent across two collections with different sequence content if they have the same `names` and `lengths`, which are correctly collated, but with pairs not necessarily in the same order. -For rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F6*](#f6-use-cases-for-the-sortednamelengthpairs-non-inherent-attribute). +For rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F7*](#f7-use-cases-for-the-sorted_name_length_pairs-non-inherent-attribute). Algorithm: @@ -438,8 +442,8 @@ In contrast, the idea of `collated` describes a property independently: Whether The specification in section 1, *Encoding*, described how to structure a sequence collection and then apply an algorithm to compute a digest for it. What if you have ancillary information that goes with a collection, but shouldn't contribute to the digest? We have found a lot of useful use cases for information that should go along with a seqcol, but should not contribute to the *identity* of that seqcol. -This is a useful construct as it allows us to include information in a collection that does not affect the identifier that is computed for that collection. -One simple example is the "author" or "uploader" of a reference sequence; this is useful information to store alongside this collection, but we wouldn't want the same collection with two different authors to have a different identifier! Seqcol refers to these as *non-inherent attributes*, meaning they are not part of the core identity of the sequence collection. +This is a useful construct as it allows us to include information in a collection that does not affect the digest that is computed for that collection. +One simple example is the "author" or "uploader" of a reference sequence; this is useful information to store alongside this collection, but we wouldn't want the same collection with two different authors to have a different digest! Seqcol refers to these as *non-inherent attributes*, meaning they are not part of the core identity of the sequence collection. Non-inherent attributes are defined in the seqcol schema, but excluded from the `inherent` list. See: [ADR on 2023-03-22 regarding inherent attributes](/decision_record/#2023-03-22-seqcol-schemas-must-specify-inherent-attributes) @@ -459,8 +463,12 @@ Thus, actual sequence content is optional for sequence collections. We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case. Thus, we can think of these as a sequence collection without sequence characters. +### F5. RFC-8785 does not apply to refget sequences -### F5. The GA4GH digest algorithm +A note to clarify potential confusion with RFC-8785. While the sequence collection specification determines that RFC-8785 will be used to canonicalize the JSON before digesting, this is specific to sequence collections, it *does not apply to the original refget sequences protocol*. According to the sequences protocol, sequences are digested as un-quoted strings. If RFC-8785 were applied at the level of individual sequences, they would be quoted to become valid JSON, which would change the digest. Since the sequences protocol predated the sequence collections protocol, it did not use RFC-8785; and anyway, the sequences are just primitive types so a canonicalization scheme doesn't add anything. This leads to the slight confusion that RFC-8785 canonicalization is only applied to the objects in the sequence collections, and not to the primitives when the underlying sequences are digested. + + +### F6. The GA4GH digest algorithm The GA4GH digest algorithm, `sha512t24u`, was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This procedure is described as ([Hart _et al_. 2020](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883)): @@ -485,7 +493,7 @@ def sha512t24u_digest(seq: bytes) -> str: See: [ADR from 2023-01-25 on digest algorithm](/decision_record/#2023-01-25-digest-algorithm) -### F6. Use cases for the `sorted_name_length_pairs` non-inherent attribute +### F7. Use cases for the `sorted_name_length_pairs` non-inherent attribute One motivation for this attribute comes from genome browsers, which may display genomic loci of interest (*e.g.* BED files). The genome browser should only show BED files if they annotate the same coordinate system as the reference genome. @@ -501,3 +509,4 @@ In practice, this list will be short. Thus, in a production setting, the full compatibility check can be reduced to a lookup into a short, pre-generated list of `sorted_name_length_pairs` digests. See: [ADR from 2023-07-12 on sorted name-length pairs](/decision_record/#2023-07-12-implementations-should-provide-sorted_name_length_pairs-and-comparison-endpoint) +