Merge pull request #66 from ga4gh/dev

Release of all recent changes for product submission
ga4gh · Feb 22, 2024 · a311e5b · a311e5b
2 parents a13d581 + 6a8c450
commit a311e5b
Show file tree

Hide file tree

Showing 6 changed files with 98 additions and 47 deletions.
diff --git a/docs/compare_collections.md b/docs/compare_collections.md
@@ -2,8 +2,8 @@
 
 ## Use case
 
-- You have a local sequence collection, and an identifier for a collection in a server. You want to compare the two to see if they have the same coordinate system.
-- You have two identifiers for collections you know are stored by a server. You want to compare them.
+- You have a local sequence collection, and an digest for a collection in a server. You want to compare the two to see if they have the same coordinate system.
+- You have two digests for collections you know are stored by a server. You want to compare them.
 
 ## How to do it
 
@@ -19,7 +19,7 @@ Therefore, we must be able to identify that two sequence collections are identic
 This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint.
 **Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.**
 Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute.
-If you have a local sequence collection, and an identifier, then you can compare them for strict identity by computing the identifier for the local collection and seeing if they match.
+If you have a local sequence collection, and an digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.
 
 ### Order-relaxed identity
 

diff --git a/docs/decision_record.md b/docs/decision_record.md
@@ -4,11 +4,11 @@
 
 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
 
-## Contents: 
+## Contents
 
 [TOC]
 
-## 2024-02-21 Schema definition for the sequence collection attributes and process for adding new attributes
+## 2024-02-21 We will specify core sequence collection attributes and a process for adding new ones
 
 ### Decision
 
@@ -169,6 +169,19 @@ We distinguished between two types of metadata:
 - <https://github.com/ga4gh/seqcol-spec/issues/39>
 - <https://github.com/ga4gh/seqcol-spec/issues/40>
 
+## 2023-07-12 - Required attributes are: lengths and names
+
+### Decision
+
+A sequence collection consists of a set of arrays. The only arrays that MUST be included for a valid sequence collection are *lengths* and *names*. All other possible arrays, including *sequences* and other controlled vocabulary arrays, are not required.
+
+### Rationale
+
+Debate around what should be mandatory as centered on 3 specific attributes: sequences, names, and lengths:
+
+At first, it feels like sequences are a fundamental component of a sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools).
+
+This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory component. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection.
 
 ## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint
 
@@ -201,11 +214,11 @@ We distinguished between two types of metadata:
 - <https://github.com/ga4gh/seqcol-spec/issues/40>
 
 
-## 2023-06-14 - Internal identifiers SHOULD NOT be prefixed
+## 2023-06-14 - Internal digest SHOULD NOT be prefixed
 
 ### Background
 
-In some situations, identifiers are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the identifier represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent.
+In some situations, digest are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent.
 
 ### Decision
 
@@ -302,7 +315,6 @@ Thus, we introduce the idea of *inherent* vs *non-inherent attributes*. Inherent
 
 We considered using `extrinsic` to define the opposite of `inherent`, which would change it so that attributes were inherent by default; but we decided we liked the explicitness of forcing the schema to specify which attributes are to be included in the digest, because this brings clarity over the alternative, which is to assume everything is included unless it's excluded. We also liked that this makes the `inherent` keyword behave similarly to the `required` keyword in JSON-schema; if left off, we assume nothing is required. This means that in order for a seqcol schema to be valid, it must have at least one inherent attribute specified.
 
-
 ## 2023-02-08 - Array names SHOULD be ASCII
 
 ### Decision
@@ -442,7 +454,9 @@ It also future-proofs the serialisation method if we ever allow complex object t
 
 The JSON canonical serialisation defined in RFC-8785 has a limited set of reference implementation. It is possible that its implementation makes sequence collection implementation more difficult in languages where the RFC is not implemented. In this cases it is valuable to note that the current specification of Sequence Collection do not require that all the features of RFC-8785 be implemented. 
 
+### Alternatives considered
 
+We spent a huge amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a third-party standard would trump the elegance, efficiency, and other benefits we recieved from our custom encoding schema. In particular, adopting the standard would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this standard accommodates other sources that we had struggled with a bit, such as UTF-encoding.
 
 ## 2022-10-05 - Terminology decisions
 
@@ -677,8 +691,6 @@ We need a formal definition of a sequence collection. The schema provides a mach
 - <https://github.com/ga4gh/seqcol-spec/issues/6>
 
 
-
-
 ## 2021-12-01 - Endpoint names and structure
 
 ### Decision
@@ -725,6 +737,41 @@ For the `POST comparison` endpoint, we made 2 limitations to simplify the implem
 - [https://github.com/ga4gh/seqcol-spec/issues/21](https://github.com/ga4gh/seqcol-spec/issues/21)
 - [https://github.com/ga4gh/seqcol-spec/issues/23](https://github.com/ga4gh/seqcol-spec/issues/23)
 
+## 2021-09-21 - Order will be recognized by digesting arrays in the given order, and unordered digests will be handled as extensions through additional attribuetes
+
+### Decision
+
+The final sequence collection digests will reflect the order by digesting the arrays in the order provided. We will employ no additional 'order' array, and no additional unordered digests *in the string-to-digest*. Any additional attributes designed to handle questions with order, such as `sorted_name_length_pairs`, will not contribute to the digest. Thus, to determine whether two sequence collections differ only in order will require either 1. using the comparison API; or 2. implementing additional functionality via digests outside the inherent attributes.
+
+### Rationale
+
+Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 3 competing ideas that could do this:
+
+A. Digest arrays in given order. 
+
+B. Reorder all given arrays according to a single canonical order, and encode order in a separate 'order' array that provides an index into the canonically ordered arrays.
+
+C. Reorder each given array individually, and then provide a separate 'order_ATTR' array as an index for each array.
+
+D. Store each array in both ordered and unordered form.
+
+After lots of initial enthusiasm for option B, we determined that it fails to deliver on the promise of staying invariant when order changes, because if there is a change in any array on which the canonical order is based, this changes the canonical ordering, which in turn changes all the array digests. So these 'unordered' (or canonically ordered) digests are in fact not fit for their main purpose. We therefore agreed to discard this option.
+
+While options C/D skirt this issue by having a separate order for each array, so that changes in one array do not affect the digest of another, they add significant complexity as everything needs to be stored twice.
+
+To conclude, option A seems simple and straightforward, satisfies for a basic implementation. We thus defer the question of determining whether two sequence collections differ only in order to the comparison API, or to some other future way to do it that will not affect the actual digests (*e.g.* the 'sorted_name_length_pairs' attribute).
+
+### Linked issues
+
+- https://github.com/ga4gh/seqcol-spec/issues/5
+
+### Known limitations
+
+For use cases that require determination of whether two sequence collections differ only in element order, option A will not provide an answer based on digest comparison alone. Instead, the query will be required to use the compatibility API, which means retrieving the contents of the array to compare them.
+
+Therefore, to answer this 'order-equivalence' question will require a bit more work than if unordered digests were available; however, this functionality can be easily implemented on top of the basic functionality in a number of ways, which we are continuing to consider.
+
+
 ## 2021-08-25 - Sequence collection digests will reflect sequence order
 
 ### Decision

diff --git a/docs/digest_from_collection.md b/docs/digest_from_collection.md
@@ -3,8 +3,7 @@
 
 ## Use case
 
-
-One of the most common uses of the seqcol specification is to compute a standard, universal identifier for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol identifier. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard).
+One of the most common uses of the seqcol specification is to compute a standard, universal digest for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol digest. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard).
 
 
 ## 1. Using existing implementations

diff --git a/docs/sequences_from_digest.md b/docs/sequences_from_digest.md
@@ -3,17 +3,17 @@
 
 ## Use case
 
-You have a seqcol digest, and you'd like to retrieve the underlying sequence identifiers, or sequences themselves.
+You have a seqcol digest, and you'd like to retrieve the underlying sequence digests, or sequences themselves.
 
 ## How to do it
 
 To look up the contents of a digest will require a seqcol service that stores the collection in a database.
 
-### 1. Retrieving the sequence identifiers
+### 1. Retrieving the sequence digests
 
-You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence identifiers, then you're done.
+You can retrieve the canonical seqcol representation by hitting the `/collection/:digest` endpoint, where `:digest` should be changed to the digest in question. If all you need is sequence digests, then you're done.
 
 
 ### 2. Retrieving underlying sequences
 
-If you need sequences, then you'll also need a [refget](http://samtools.github.io/hts-specs/refget.html) server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence identifiers, and then use these identifiers to query a refget service.
+If you need sequences, then you'll also need a [refget](http://samtools.github.io/hts-specs/refget.html) server. Sequence collection services don't necessarily store sequences themselves; this task is typically outsource to a refget server. The seqcol server simply stores the group information, and metadata accompanying the sequences. Therefore, to retrieve the underlying sequences, you can first retrieve the sequence digests, and then use these digests to query a refget service.
diff --git a/docs/simple_example.md b/docs/simple_example.md