-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representing caseLevelData/zygosity with VRS alleles #57
Comments
That's actually a VRS question in the first place - how do I express genotypes in VRS, which is not yet part of "stable": https://vrs.ga4gh.org/en/latest/terms_and_model.html?#genotype The current Beacon v2.0 explicitly references VRS 1.21 which does not yet contain a genotype definition. However:
So for anything I'd build right now I'd just go w/ the upcoming VRS. But all this doesn't impact the query side where anyway no combined Now, this is about data representation ... Personally I'm not a fan of these direct "a variant is a genotype, at least sometimes". IMO for storage variants should always be alleles (or haplotypes, for phasing; + systemic like CNV) and then be post-composed (i.e. same This criticism does not apply to the VRS model which will provide a transparent genotype composition if/when needed and anyway isn't really for data storage (in contrast to VCF files...). Footnotes
|
I'm still a bit confused...if each variant represents a single variation, that is, one allele, how can that same variant contain zygosity in caseLevelData? The definition for CaseLevelVariant contains an entry for zygosity, which is only apparently a word like heterozygous/homozygous, but how do you represent what the CaseLevelVariant is heterozygous or homozygous for, if there's only one variation listed? |
Hi,
Assuming you are looking for T > A
Heterozygous would be T + A (one for each chromosome copy)
Homozygous would be A + A (one for each chromosome copy)
"T" is the allele in the reference genome, and the second case level has both copies "mutated".
Hope this clarifies.
Jordi
…________________________________
De: Daisie Huang ***@***.***>
Enviat el: dijous, 16 de febrer de 2023 20:47
Per a: ga4gh-beacon/beacon-v2 ***@***.***>
A/c: Subscribed ***@***.***>
Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57)
I'm still a bit confused...if each variant represents a single variation, that is, one allele, how can that same variant contain zygosity in caseLevelData? The definition for CaseLevelVariant contains an entry for zygosity, which is only apparently a word like heterozygous/homozygous, but how do you represent what the CaseLevelVariant is heterozygous or homozygous for, if there's only one variation listed?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433623273__;Iw!!D9dNQwwGXtA!WGxsB53g_DSPoUoQkTT6850czg98VCGY8H5shTrh1OdOzp1ZICh5Y1RK_KTEwzJf0rVFApHfjA5nNPenTq-nYze-GF4$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOVCBIKULTWVWYOG3L3WXZ73JANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!WGxsB53g_DSPoUoQkTT6850czg98VCGY8H5shTrh1OdOzp1ZICh5Y1RK_KTEwzJf0rVFApHfjA5nNPenTq-noEIH3Mo$>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Okay. So this is assuming that we wouldn't want to record a T/T homozygote at all, because they're both reference? |
Hi
We probably would need more details on the use case, but I believe that the spec is allowing such reference homozygous w/o problem.
Jordi
…________________________________
De: Daisie Huang ***@***.***>
Enviat el: dijous, 16 de febrer de 2023 21:33
Per a: ga4gh-beacon/beacon-v2 ***@***.***>
A/c: Jordi Rambla ***@***.***>; Comment ***@***.***>
Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57)
Okay. So this is assuming that we wouldn't want to record a T/T homozygote at all, because they're both reference?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433680355__;Iw!!D9dNQwwGXtA!Um1XanLM_kkHaMXXtRDyVb1sjqhmtW6sTdFY0JOp0XP7RzD7QIC-PLI9aGzdWI2lbvpkj60fkudKFA8jEyCELRTRDfU$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOXNANCPSGGAPW2ZHLTWX2FJ3ANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!Um1XanLM_kkHaMXXtRDyVb1sjqhmtW6sTdFY0JOp0XP7RzD7QIC-PLI9aGzdWI2lbvpkj60fkudKFA8jEyCEwAwllGc$>.
You are receiving this because you commented.Message ID: ***@***.***>
|
How would that reference homozygous caseLevelVariant be represented in the schema? |
This is a good question.
Given that this seems a corner case, my colleagues could have a different view, but I would suggest using 0/0 for ti
Jordi
…________________________________
De: Daisie Huang ***@***.***>
Enviat el: dijous, 16 de febrer de 2023 21:56
Per a: ga4gh-beacon/beacon-v2 ***@***.***>
A/c: Jordi Rambla ***@***.***>; Comment ***@***.***>
Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57)
How would that reference homozygous caseLevelVariant be represented in the schema?
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1433704197__;Iw!!D9dNQwwGXtA!XOraTiPuDU7NWU6L-FZzC8k1UoXQZmridKJbPQPtGGoLbgbdcxvKqb0pUY-vdLGoDom6jo5U8aIN0DQUxNAoMxJjcyI$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOW5IBM72O4TFL7PDODWX2IBXANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!XOraTiPuDU7NWU6L-FZzC8k1UoXQZmridKJbPQPtGGoLbgbdcxvKqb0pUY-vdLGoDom6jo5U8aIN0DQUxNAoHhib46c$>.
You are receiving this because you commented.Message ID: ***@***.***>
|
It usually would be implicit, by not being reported/found, since being the default (if it is a homozygous case of the predominant allele). Which isn’t a very robust assumption. The basic principle of genomic data exchange is that we talk about variations on some reference. However, one can only be sure about the state of any specific locus if having a confirmation that it has been assessed. So if you have a variable locus which has been reported in your population analysis (i.e. it got its line in a VCF) you can read out that your sample didn’t have a variation. However, such assertions are only for assessed loci; there is no guarantee that your locus has been assessed in a study (think panel or WES and intergenic region_. And Beacon instances wouldn’t usually report on the A good point overall: How should Beacon instances match reference allele matches? |
The general question is that to me, it seems like Beacon/VRS should want to capture what is possible in a VCF file. Since VCF specifically mentions "ALT — alternate base(s): Comma-separated list of alternate non-reference alleles" and therefore the Genotype field as being written as "GT (String): Genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc," that VRS would want to be able to capture this. If a variant is only for a single alternate allele, I'm not clear on how one would capture genotypes for multiple alternate alleles, and therefore VRS would be losing data representation, relative to VCF? We have vcf files from cancer patients. Each file contains two samples, TUMOUR and NORMAL. I think that our users will want to be able to query about whether or not we have patients that have variants present at a site. I think that it's most likely that the general-question edge case I mentioned above won't happen in our data, but I think it's possible that it could happen: let's say that a patient has a germline alternative genotype at a site, but the cancer has mutated one copy to yet another allele. I'd imagine that the VCF record could look something like:
|
My five cents. In the reference implementation we start with VCFs but at the database level we store each variant as biallelic. Thus, multiallelic variants are split into as many fields as ALT alleles (ALT> 0). Depending in how you formulate the query, you may get one hit or multiple. You my wanna take a look to this file. Hope this helps. Thanks, Manu |
Thanks, Manu. That file is one of the ones I had been looking at, so it's good to hear from you. I think you were using the LegacyVariation schema, though, which does at least allow ref/alt in a single GenomicVariant...if you were to switch to the VRS MolecularVariation schema, how would you do it? |
The description in the PR above is making "your" case clear to me. The solution you suggest seems reasonable to me, but also makes the spec more complex (and it is enough already). Our rough suggestion, as per today, would be for the Beacon client (note that I don't say user) to query for heterozygous for allele A, save the list of sample donors, then query for the heterozygous for allele B, and intersect the list with the saved one. The intersection must give you the 1/2 expressed by VCF. Of course, this is a two step solution, but we envision some not simple queries to be addressed that way. Also, as some of the contributors had said, we are evolving the spec according to the feedback we are having and a community process... for the compound heterozygosity, a simpler solution could be to add something like:
This is a very rough idea, but I hope the principle of it is clear enough. |
This would work for my system, I think. I'll try it out and see how it goes. Thank you for the idea! |
Any feedback would be appreciated, yes!
Jordi
…________________________________
De: Daisie Huang ***@***.***>
Enviat el: diumenge, 26 de febrer de 2023 20:35
Per a: ga4gh-beacon/beacon-v2 ***@***.***>
A/c: Jordi Rambla ***@***.***>; Comment ***@***.***>
Tema: Re: [ga4gh-beacon/beacon-v2] Representing caseLevelData/zygosity with VRS alleles (Issue #57)
This would work for my system, I think. I'll try it out and see how it goes. Thank you for the idea!
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/ga4gh-beacon/beacon-v2/issues/57*issuecomment-1445446755__;Iw!!D9dNQwwGXtA!WVU1R5USB-bQ2iGMhMqgYUjy58YZHsK43PQiZ60FdIG5pdSMT679rqC22VnOEy_eTe0_Q1BlrkDbCOg8sRXbbkny8nU$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB5SEOTMZI57HGUSBOXBAUDWZOWAJANCNFSM6AAAAAAU5KRHKY__;!!D9dNQwwGXtA!WVU1R5USB-bQ2iGMhMqgYUjy58YZHsK43PQiZ60FdIG5pdSMT679rqC22VnOEy_eTe0_Q1BlrkDbCOg8sRXb_GSUY7M$>.
You are receiving this because you commented.Message ID: ***@***.***>
|
@daisieh I'll try to summarize this in the FAQ soonish... Edit: Note here http://docs.genomebeacons.org/FAQ/#haplotypes; please extend/fix at https://github.com/ga4gh-beacon/beacon-v2/blob/main/docs/FAQ.md ... |
Would there be any harm in suggesting |
I'm sure that I understand your suggestion correctly, but I will risk commenting on it ;-) |
If there were three alleles seen at a specific location, like 22:1000-1001, and we had three samples with genotypes 0/0, 0/1 and 1/2, I am suggesting that you'd have:
So all alleles present in the samples are accounted for in the caseLevelData, and are associated with their biosample and the other allele in the genotype if there is one present. |
I've summarized what I'm suggesting above in this yaml snippet from my openapi schema for a caseLevelVariant. Instead of just zygosity, replace with this object:
|
A development relating to this is the proposal to create a While a main use case would be "double hit" events (mutation of one allele & deletion of the other of a tumor suppressor gene) this would also serve for "2 alleles" cases. However, it would not delineate homozygous ALT + ALT from REF + ALT genotypes since there is no way ATM to query for reference matched alleles. In principle moving to VRS style and querying for
Now, the backend would have to get the match to the reference IMO at least a partial but also more general solution (since not focussed on genotypes only) and more in the "Beacon spirit". |
If I'm creating a genomicVariant specification from a VCF variant record, I can't see how I'd specify multiple alleles in a single genomicVariant: the variation property seems to be singular? For example, a variant record might have a ref
A
and an altC,T
. Samples in that record might have genotypes that correspond toA/C
,A/A
,A/T
,C/T
.LegacyVariation seems to be able to capture basic VCF-format ref/alt, at least in the case where there is only one alternate allele. It does not seem like there's an option for multiple alt alleles. So I could capture zygosity/genotype for
A/A
andA/T
as caseLevelData corresponding to one Variation, andA/A
andA/C
as a different one (even there, how would I know which variation to put theA/A
cases in?). But how would I representC/T
samples?VRS's MolecularVariation seems to be the preferred schema moving forward, I assume. It seems like in this schema, there is no idea of a reference allele at all: each allele is represented by a single variation. But without an ability to specify multiple variations for a genomicVariant, how would I represent zygosity for caseLevelData?
The text was updated successfully, but these errors were encountered: