Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing caseLevelData/zygosity with VRS alleles #57

Open
daisieh opened this issue Feb 15, 2023 · 20 comments
Open

Representing caseLevelData/zygosity with VRS alleles #57

daisieh opened this issue Feb 15, 2023 · 20 comments

Comments

@daisieh
Copy link

daisieh commented Feb 15, 2023

If I'm creating a genomicVariant specification from a VCF variant record, I can't see how I'd specify multiple alleles in a single genomicVariant: the variation property seems to be singular? For example, a variant record might have a ref A and an alt C,T. Samples in that record might have genotypes that correspond to A/C, A/A, A/T, C/T.

LegacyVariation seems to be able to capture basic VCF-format ref/alt, at least in the case where there is only one alternate allele. It does not seem like there's an option for multiple alt alleles. So I could capture zygosity/genotype for A/A and A/T as caseLevelData corresponding to one Variation, and A/A and A/C as a different one (even there, how would I know which variation to put the A/A cases in?). But how would I represent C/T samples?

VRS's MolecularVariation seems to be the preferred schema moving forward, I assume. It seems like in this schema, there is no idea of a reference allele at all: each allele is represented by a single variation. But without an ability to specify multiple variations for a genomicVariant, how would I represent zygosity for caseLevelData?

@mbaudis
Copy link
Member

mbaudis commented Feb 16, 2023

That's actually a VRS question in the first place - how do I express genotypes in VRS, which is not yet part of "stable": https://vrs.ga4gh.org/en/latest/terms_and_model.html?#genotype

The current Beacon v2.0 explicitly references VRS 1.21 which does not yet contain a genotype definition. However:

  • the model just describes how you should represent record-level data in responses but has no strong binding on how you store your data
  • You can use your own schema and reference this in the response since you are not bound to use the Beacon default model in the response (well, it is good practice to do so, but...). So here having the VRS version changed to latest while leaving everything else should be an easy way of a "forward looking implementation" if this is needed for parsing/verification.
  • IMO we will update to keep in line with VRS as soon as there is a stable update (there are also the upcoming structural variant improvements)

So for anything I'd build right now I'd just go w/ the upcoming VRS. But all this doesn't impact the query side where anyway no combined alternateBases are allowed - so strictly only alleles can be queried2.


Now, this is about data representation ... Personally I'm not a fan of these direct "a variant is a genotype, at least sometimes". IMO for storage variants should always be alleles (or haplotypes, for phasing; + systemic like CNV) and then be post-composed (i.e. same analysisId). Which leads us to have a collection step at the Progenetix export stage where we create a variant from all case level instances of the same change instead having one variant w/ all its instances ¯\_(ツ)_/¯.

This criticism does not apply to the VRS model which will provide a transparent genotype composition if/when needed and anyway isn't really for data storage (in contrast to VCF files...).

Footnotes

  1. https://raw.githubusercontent.com/ga4gh/vrs/1.2/schema/vrs.json#/definitions/MolecularVariation

  2. You can file one or more issues here :-)

@daisieh
Copy link
Author

daisieh commented Feb 16, 2023

I'm still a bit confused...if each variant represents a single variation, that is, one allele, how can that same variant contain zygosity in caseLevelData? The definition for CaseLevelVariant contains an entry for zygosity, which is only apparently a word like heterozygous/homozygous, but how do you represent what the CaseLevelVariant is heterozygous or homozygous for, if there's only one variation listed?

@jrambla
Copy link
Contributor

jrambla commented Feb 16, 2023 via email

@daisieh
Copy link
Author

daisieh commented Feb 16, 2023

Okay. So this is assuming that we wouldn't want to record a T/T homozygote at all, because they're both reference?

@jrambla
Copy link
Contributor

jrambla commented Feb 16, 2023 via email

@daisieh
Copy link
Author

daisieh commented Feb 16, 2023

How would that reference homozygous caseLevelVariant be represented in the schema?

@jrambla
Copy link
Contributor

jrambla commented Feb 16, 2023 via email

@mbaudis
Copy link
Member

mbaudis commented Feb 16, 2023

It usually would be implicit, by not being reported/found, since being the default (if it is a homozygous case of the predominant allele). Which isn’t a very robust assumption.

The basic principle of genomic data exchange is that we talk about variations on some reference. However, one can only be sure about the state of any specific locus if having a confirmation that it has been assessed.

So if you have a variable locus which has been reported in your population analysis (i.e. it got its line in a VCF) you can read out that your sample didn’t have a variation.

However, such assertions are only for assessed loci; there is no guarantee that your locus has been assessed in a study (think panel or WES and intergenic region_.

And Beacon instances wouldn’t usually report on the T since it isn’t a variant. So a query for it with a “no” response would not mean that it isn’t there since it is not an alternativeBase. It could be reported, though, if interpreted as “if query base is reference than report hits on reference, i.e. reference allele in variant locus”.

A good point overall: How should Beacon instances match reference allele matches?

@daisieh
Copy link
Author

daisieh commented Feb 17, 2023

The general question is that to me, it seems like Beacon/VRS should want to capture what is possible in a VCF file. Since VCF specifically mentions "ALT — alternate base(s): Comma-separated list of alternate non-reference alleles" and therefore the Genotype field as being written as "GT (String): Genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc," that VRS would want to be able to capture this. If a variant is only for a single alternate allele, I'm not clear on how one would capture genotypes for multiple alternate alleles, and therefore VRS would be losing data representation, relative to VCF?

We have vcf files from cancer patients. Each file contains two samples, TUMOUR and NORMAL. I think that our users will want to be able to query about whether or not we have patients that have variants present at a site. I think that it's most likely that the general-question edge case I mentioned above won't happen in our data, but I think it's possible that it could happen: let's say that a patient has a germline alternative genotype at a site, but the cancer has mutated one copy to yet another allele. I'd imagine that the VCF record could look something like:

REF: A
ALT: C,T
GT-NORMAL: 1/1
GT-TUMOUR: 1/2

@mrueda
Copy link
Collaborator

mrueda commented Feb 17, 2023

My five cents.

In the reference implementation we start with VCFs but at the database level we store each variant as biallelic. Thus, multiallelic variants are split into as many fields as ALT alleles (ALT> 0). Depending in how you formulate the query, you may get one hit or multiple.

You my wanna take a look to this file.

Hope this helps.

Thanks,

Manu

@daisieh
Copy link
Author

daisieh commented Feb 17, 2023

Thanks, Manu. That file is one of the ones I had been looking at, so it's good to hear from you. I think you were using the LegacyVariation schema, though, which does at least allow ref/alt in a single GenomicVariant...if you were to switch to the VRS MolecularVariation schema, how would you do it?

@jrambla
Copy link
Contributor

jrambla commented Feb 25, 2023

The description in the PR above is making "your" case clear to me.
Most of the things other contributors had said makes sense: we are following VRS, we model for single variants in the genomicVariation, etc.

The solution you suggest seems reasonable to me, but also makes the spec more complex (and it is enough already). Our rough suggestion, as per today, would be for the Beacon client (note that I don't say user) to query for heterozygous for allele A, save the list of sample donors, then query for the heterozygous for allele B, and intersect the list with the saved one. The intersection must give you the 1/2 expressed by VCF. Of course, this is a two step solution, but we envision some not simple queries to be addressed that way.

Also, as some of the contributors had said, we are evolving the spec according to the feedback we are having and a community process... for the compound heterozygosity, a simpler solution could be to add something like:

            "caseLevelData": [
              {
                "zygosity": {
                  "label": "1/2", // or 1/* or 1/?
                  "id": "GENO:GENO_0000402",
                 "secondaryAlleleIds": {
                        "genomicHGVSId": "NC_000001.11:g.55039979G>A"
                      }
                },
                "biosampleId": "HG03770"
              }
            ],

This is a very rough idea, but I hope the principle of it is clear enough.
Does it make sense?

@daisieh
Copy link
Author

daisieh commented Feb 26, 2023

This would work for my system, I think. I'll try it out and see how it goes. Thank you for the idea!

@jrambla
Copy link
Contributor

jrambla commented Feb 27, 2023 via email

@mbaudis
Copy link
Member

mbaudis commented Feb 27, 2023

@daisieh I'll try to summarize this in the FAQ soonish...

Edit: Note here http://docs.genomebeacons.org/FAQ/#haplotypes; please extend/fix at https://github.com/ga4gh-beacon/beacon-v2/blob/main/docs/FAQ.md ...

@daisieh
Copy link
Author

daisieh commented Mar 2, 2023

Would there be any harm in suggesting secondaryAlleleIds as a property in zygosity for every variation? That is, a sample with zygosity 0/1 would be listed in caseLevelData for the ref variation, with a secondaryAlleleId for the alt variation, and also listed in caseLevelData for the alt variation, with a secondaryAlleleId for the ref variation?

@jrambla
Copy link
Contributor

jrambla commented Mar 2, 2023

I'm sure that I understand your suggestion correctly, but I will risk commenting on it ;-)
If a user wants to know all variants in a given position, this is equivalent to query for a region that only includes that base (e.g. chr22:1000-1001). This query will return all variants seen by that Beacon in that position.

@daisieh
Copy link
Author

daisieh commented Mar 2, 2023

If there were three alleles seen at a specific location, like 22:1000-1001, and we had three samples with genotypes 0/0, 0/1 and 1/2, I am suggesting that you'd have:

{
    "variation": { allele_0 },
    "caseLevelData": [
        {
            "biosampleId": sample_1,
            "zygosity": {
                "label": "0/0"
            }
        },
        {
            "biosampleId": sample_2,
            "zygosity": {
                "label": "0/1",
                "secondaryAlleleId": allele_0
            }
        }
    ]
},
{
    "variation": { allele_1 },
    "caseLevelData": [
        {
            "biosampleId": sample_2,
            "zygosity": {
                "label": "0/1",
                "secondaryAlleleId": allele_0
            }
        },
        {
            "biosampleId": sample_3,
            "zygosity": {
                "label": "1/2",
                "secondaryAlleleId": allele_2
            }
        }
    ]
},
{
    "variation": { allele_2 },
    "caseLevelData": [
        {
            "biosampleId": sample_3,
            "zygosity": {
                "label": "1/2",
                "secondaryAlleleId": allele_1
            }
        }
    ]
}

So all alleles present in the samples are accounted for in the caseLevelData, and are associated with their biosample and the other allele in the genotype if there is one present.

@daisieh
Copy link
Author

daisieh commented Mar 2, 2023

I've summarized what I'm suggesting above in this yaml snippet from my openapi schema for a caseLevelVariant. Instead of just zygosity, replace with this object:

              genotype:
                type: object
                properties:
                  zygosity:
                    $ref: '#/components/schemas/OntologyTerm'
                    description: Ontology term for zygosity in which variant is present in the sample from the Zygosity Ontology (GENO:0000391) , e.g `heterozygous` (GENO:0000135)
                    examples:
                      - id: GENO:0000458
                        label: simple heterozygous
                      - id: GENO:0000402
                        label: compound heterozygous
                      - id: GENO:0000136
                        label: homozygous
                  value:
                    type: string
                    description: VCF GT-style value, e.g. 0/0, 1|2
                  secondaryAlleleIds:
                    type: array
                    description: variantInternalIds of the other allele(s) present in this genotype
                    items:
                      type: string
                required:
                  - zygosity

@mbaudis
Copy link
Member

mbaudis commented Nov 5, 2024

A development relating to this is the proposal to create a multiVariantParameters (name TBD...) query object which is just a list of basic current g_variant query objects and where the procedure would be to return all biosample objects containing hits for all of the variants.

While a main use case would be "double hit" events (mutation of one allele & deletion of the other of a tumor suppressor gene) this would also serve for "2 alleles" cases. However, it would not delineate homozygous ALT + ALT from REF + ALT genotypes since there is no way ATM to query for reference matched alleles.

In principle moving to VRS style and querying for sequence together with multiVariantParameters query would work. Given a reference allele of T and an alternate of A:

multiVariantParameters:
  - referenceName: "refseq:NC_000003.12"
    start: [1234567]
    sequence: "T"
  - referenceName: "refseq:NC_000003.12"
    start: [1234567]
    sequence: "A"

Now, the backend would have to get the match to the reference T allele as an interpretation of a 0/1 call in a VCF, or through whatever logic one uses to retrieve this.

IMO at least a partial but also more general solution (since not focussed on genotypes only) and more in the "Beacon spirit".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants