Skip to content

Commit

Permalink
Merge pull request #68 from sveinugu/misc_small_fixes
Browse files Browse the repository at this point in the history
Miscellaneous small fixes from review of #66
  • Loading branch information
nsheff authored Feb 22, 2024
2 parents a311e5b + 5b5ce1a commit 0fe277f
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 11 deletions.
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ extra_css: [extra.css]
<h1>Seqcol: Sequence Collections</h1>
<p>Unique identifiers and lookup service for sequence collections.
</p>
<p><a class="btn btn-primary btn-lg" href="specification" role="button">Learn more</a></p>
<p><a class="btn btn-primary btn-lg" href="specification.md" role="button">Learn more</a></p>
</div>
</div>
<div class="container">
Expand All @@ -21,7 +21,7 @@ extra_css: [extra.css]
<li>programmatic approach to assessing compatibility among sequence collections.</li>
</ol>
</p>
<a href="specification">Read the complete specification</a>
<a href="specification.md">Read the complete specification</a>
</blockquote>
</div>
<div class="col-md-4 text-center">
Expand Down
4 changes: 2 additions & 2 deletions docs/compare_collections.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Use case

- You have a local sequence collection, and an digest for a collection in a server. You want to compare the two to see if they have the same coordinate system.
- You have a local sequence collection, and a digest for a collection in a server. You want to compare the two to see if they have the same coordinate system.
- You have two digests for collections you know are stored by a server. You want to compare them.

## How to do it
Expand All @@ -19,7 +19,7 @@ Therefore, we must be able to identify that two sequence collections are identic
This comparison can easily be done by simply comparing the seqcol digest, you don't need the `/comparison` endpoint.
**Two collections will have the same digest if they are identical in content and order for all `inherent` attributes.**
Therefore, if the digests differ, then you know the collections differ in at least one inherent attribute.
If you have a local sequence collection, and an digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.
If you have a local sequence collection and a digest, then you can compare them for strict identity by computing the digest for the local collection and seeing if they match.

### Order-relaxed identity

Expand Down
12 changes: 6 additions & 6 deletions docs/decision_record.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,9 +179,9 @@ A sequence collection consists of a set of arrays. The only arrays that MUST be

Debate around what should be mandatory as centered on 3 specific attributes: sequences, names, and lengths:

At first, it feels like sequences are a fundamental component of a sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools).
At first, it feels like sequences are fundamental components of sequence collections, and therefore, the *sequences* array should be mandatory, and names and lengths may be superfluous. For reference genomes, for example, it's clear that collections of sequences are the main function of sequence collections. However, analysis of reference genome data also includes many analyses for which the sequences themselves do not matter, and the critical component is simply the name and length of the sequence. An array of names and lengths can be thought of as a *coordinate system*, and we have realized that the sequence collection specification is *also* extremely useful for representing and uniquely identifying coordinate systems. From this perspective, we envision a coordinate system as a sequence collection in which the actual sequence content is irrelevant, but in which the lengths and names of the sequences are critical. Analysis of coordinate systems like this is very frequent. For example, any sort of annotation analysis looking at genomic regions will rely on the lengths of the sequences to enforce that coordinates refer to the same thing, but do not rely on the underlying sequences. This is why "chrom-sizes" files are used so frequently (*e.g.* across many UCSC tools).

This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory component. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection.
This leads us to the conclusion that *sequences* should be optional, and *names* and *lengths* should be the only mandatory components. *Lengths* makes sense because if you have a sequence, you can always compute it's length, but if you don't have a sequence (all you have is a coordinate system), you may only have a length. We debated extensively whether *names* should be mandatory, and in the end, decided that it's unlikely to pose much of a difficulty to make it mandatory, and provides a lot of convenience. If sequences lack names altogether, it is trivial to name them by index of the order of the sequences. We reason that downstream use cases are very likely to require at least *some* type of identifier to refer to each of the sequences, even if it's just the index of the sequence in the list. While it may be possible to imagine a use case where an identifier for each sequence is not required, it's not difficult at all to just assign indexes. By making it required, we ensure that implementations will always have the same possible way to reference the sequences in the collection. Also, one potential use case for dropping the *names* array, namely to provide name-invariant sequence records for mapping purposes, will instead be possible to solve through defining an extra *non-inherent* and name-invariant attribute.

## 2023-07-12 Implementations SHOULD provide sorted_name_length_pairs and comparison endpoint

Expand Down Expand Up @@ -214,11 +214,11 @@ This leads us to the conclusion that *sequences* should be optional, and *names*
- <https://github.com/ga4gh/seqcol-spec/issues/40>


## 2023-06-14 - Internal digest SHOULD NOT be prefixed
## 2023-06-14 - Internal digests SHOULD NOT be prefixed

### Background

In some situations, digest are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent.
In some situations, digests are prefixed. For example, these may be CURIEs, which specify namespaces or provide other information about what the digest represents. This raises questions about when and where we should expect or use prefixes. This has to be determined because including prefixes in the content that gets digested changes it, so we have to be consistent.

### Decision

Expand Down Expand Up @@ -456,7 +456,7 @@ The JSON canonical serialisation defined in RFC-8785 has a limited set of refere

### Alternatives considered

We spent a huge amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a third-party standard would trump the elegance, efficiency, and other benefits we recieved from our custom encoding schema. In particular, adopting the standard would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this standard accommodates other sources that we had struggled with a bit, such as UTF-encoding.
We spent a significant amount of time discussing approaches for what essentially amounts to a custom standard for creating the string-to-digest. A lot of this revolved around what delimiters to use. We made a lot of progress there and came up with some really interesting encoding schemas, which had many desirable characteristics. However, ultimately we decided that the value derived from using a comprehensive and well-developed third-party solution would trump the elegance, efficiency, and other benefits we received from our custom encoding schema. In particular, adopting the RFC-8785 would make developers more likely to be able to rely on third-party implementations, reducing the burden to implement our standard. Also, this solution accommodates other sources that we had struggled with a bit, such as UTF-encoding.

## 2022-10-05 - Terminology decisions

Expand Down Expand Up @@ -745,7 +745,7 @@ The final sequence collection digests will reflect the order by digesting the ar

### Rationale

Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 3 competing ideas that could do this:
Our earlier decision determined that order *must* be reflected in the sequence digests, but did not determine the way to ensure that. After months of debate we came up with 4 competing ideas that could do this:

A. Digest arrays in given order.

Expand Down
2 changes: 1 addition & 1 deletion docs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ A common example and primary use case of sequence collections is for reference g

In brief, the project specifies several procedures:

1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digest are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance.
1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digests are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance.
2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies a RESTful API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible.
3. **Recommended ancillary, non-inherent attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond.

Expand Down

0 comments on commit 0fe277f

Please sign in to comment.