From 0199c8fa15cbae6f6e021e3463aa94bf20e527de Mon Sep 17 00:00:00 2001 From: Sveinung Gundersen Date: Thu, 22 Feb 2024 19:32:49 +0100 Subject: [PATCH 1/5] Misc grammar fixes --- docs/compare_collections.md | 4 ++-- docs/decision_record.md | 10 +++++----- docs/specification.md | 2 +- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/compare_collections.md b/docs/compare_collections.md index 9322c45..b4de501 100644 --- a/docs/compare_collections.md +++ b/docs/compare_collections.md @@ -2,7 +2,7 @@ ## Use case -- You have a local sequence collection, and an digest for a collection in a server. You want to compare the two to see if they have the same coordinate system. +- You have a local sequence collection, and a digest for a collection in a server. You want to compare the two to see if they have the same coordinate system. - You have two digests for collections you know are stored by a server. You want to compare them. ## How to do it @@ -19,7 +19,7 @@ Therefore, we must be able to identify that two sequence collections are identic This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint. **Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.** Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute. -If you have a local sequence collection, and an digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match. +If you have a local sequence collection and a digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match. ### Order-relaxed identity diff --git a/docs/decision_record.md b/docs/decision_record.md index b49bf2e..e40d95c 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -179,9 +179,9 @@ A sequence collection consists of a set of arrays. The only arrays that MUST be Debate around what should be mandatory as centered on 3 specific attributes: sequences, names, and lengths: -At first, it feels like sequences are a fundamental component of a sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools). +At first, it feels like sequences are fundamental components of sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools). -This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory component. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. +This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. ## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint @@ -214,11 +214,11 @@ This leads us to the conclusion that *sequences* should be optional, and *names* - -## 2023-06-14 - Internal digest SHOULD NOT be prefixed +## 2023-06-14 - Internal digests SHOULD NOT be prefixed ### Background -In some situations, digest are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent. +In some situations, digests are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent. ### Decision @@ -745,7 +745,7 @@ The final sequence collection digests will reflect the order by digesting the ar ### Rationale -Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 3 competing ideas that could do this: +Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 4 competing ideas that could do this: A. Digest arrays in given order. diff --git a/docs/specification.md b/docs/specification.md index 81e5f8b..632b4b3 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -26,7 +26,7 @@ A common example and primary use case of sequence collections is for reference g In brief, the project specifies several procedures: -1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digest are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. +1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digests are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. 2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies a RESTful API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. 3. **Recommended ancillary, non-inherent attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. From 2e02d03d0647adb219b28bf3f0de01e8aa339f31 Mon Sep 17 00:00:00 2001 From: Sveinung Gundersen Date: Thu, 22 Feb 2024 19:33:19 +0100 Subject: [PATCH 2/5] RFC-8785 is not a standard. Rewrite accordingly --- docs/decision_record.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index e40d95c..0d40c38 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -456,7 +456,7 @@ The JSON canonical serialisation defined in RFC-8785 has a limited set of refere ### Alternatives considered -We spent a huge amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a third-party standard would trump the elegance, efficiency, and other benefits we recieved from our custom encoding schema. In particular, adopting the standard would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this standard accommodates other sources that we had struggled with a bit, such as UTF-encoding. +We spent a significant amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a comprehensive and well-developed third-party solution would trump the elegance, efficiency, and other benefits we received from our custom encoding schema. In particular, adopting the RFC-8785 would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this solution accommodates other sources that we had struggled with a bit, such as UTF-encoding. ## 2022-10-05 - Terminology decisions From 045d8f458880917ed2eafba2b3e9620bc293c85e Mon Sep 17 00:00:00 2001 From: Sveinung Gundersen Date: Thu, 22 Feb 2024 19:35:17 +0100 Subject: [PATCH 3/5] Added additional argument against making names optional --- docs/decision_record.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index 0d40c38..3da6434 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -181,7 +181,7 @@ Debate around what should be mandatory as centered on 3 specific attributes: seq At first, it feels like sequences are fundamental components of sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools). -This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. +This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. Also, one potential use case for dropping the *names* array, namely to provide name-invariant sequence records for mapping purposes, will instead be possible to solve through defining an extra *non-inherent* and name-invariant attribute. ## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint From e43b97202a07a1029df83ef401b179a5973fe942 Mon Sep 17 00:00:00 2001 From: Sveinung Gundersen Date: Thu, 22 Feb 2024 19:44:52 +0100 Subject: [PATCH 4/5] Remove extra space --- docs/decision_record.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index 3da6434..1ca880d 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -181,7 +181,7 @@ Debate around what should be mandatory as centered on 3 specific attributes: seq At first, it feels like sequences are fundamental components of sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools). -This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. Also, one potential use case for dropping the *names* array, namely to provide name-invariant sequence records for mapping purposes, will instead be possible to solve through defining an extra *non-inherent* and name-invariant attribute. +This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. Also, one potential use case for dropping the *names* array, namely to provide name-invariant sequence records for mapping purposes, will instead be possible to solve through defining an extra *non-inherent* and name-invariant attribute. ## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint From 5b5ce1aba58bacd3320dc1ffa9dff54be999c8be Mon Sep 17 00:00:00 2001 From: Sveinung Gundersen Date: Thu, 22 Feb 2024 20:01:56 +0100 Subject: [PATCH 5/5] Fixed links to specification --- docs/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/README.md b/docs/README.md index dba02b2..d8639cb 100644 --- a/docs/README.md +++ b/docs/README.md @@ -6,7 +6,7 @@ extra_css: [extra.css]

Seqcol: Sequence Collections

Unique identifiers and lookup service for sequence collections.

-

Learn more

+

Learn more

@@ -21,7 +21,7 @@ extra_css: [extra.css]
  • programmatic approach to assessing compatibility among sequence collections.
  • - Read the complete specification + Read the complete specification