-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If a mutation in the bundle is unmeasured, throw away data for all variants with mutations at this site #84
Comments
Just to clarify, you're saying if either the forward or the reversion exists, then the double mut counting is not a problem bc:
|
It seems a little hand-wavy and maybe too confident about what Not that I disagree with the logic, but it seems like currently, this is purely a theoretical problem and thus maybe not as high priority ... |
I agree if just the forward mutation is there it's a bit strange. And same thing for just the reverse mutation. So there may be an argument for excluding the site in those cases as well. I think you're right that it isn't currently a problem with the spike data since I think all forward and reverse mutations in the bundle are sampled in the libraries (excluding indels). I think if we made the change I suggested, there wouldn't be much or any additional data that gets thrown out and the overall results wouldn't change. But, I think it is a theoretical problem for future users. It seems like an easy fix, and it makes it easier to describe a general strategy for dropping mutations in the bundle from the summation term when they are missing from the data. Whether For priority, I think it's probably something we'd want to update before posting the paper. But let's discuss on Monday. I know you have a lot on your plate! |
I think I understand the logic above, but to me it seems like another symptom of our approach to modeling unobserved mutations, which is to not model them. While this approach avoids assuming some additional prior information, it results in an overly brittle model. The other main symptom of this is that we've convinced ourselves—erroneously, I've argued—that we can't do out-of-sample prediction. I'll record below how I would deal with unobserved mutations, since this is tangentially relevant to what you decide for this issue: It is not unusual for a statistical model on categorical data to have to cope with features that end up being constant over a training set, but variable in a test set.
|
Thanks, Will. I agree that these are neat ideas. For the purposes of getting something out the door this week, my suggestion would be to have our base model be something that throws out the problematic variants described above, so that users could opt to avoid making assumptions if they wish. But, then to include your suggested approaches for estimating the effects of unobserved mutations as optional features that we could add onto the base model in future versions of multidms. |
@jgallowa07: just pinging this thread since in my opinion it would be good to resolve this issue in version 1.0. To sum up our above convo, I'd suggest using the draft code we already wrote to:
This is likely to have zero or very minimal effect on the spike data, as well as other datasets from the Bloom lab. But, it helps to avoid problems in the future and provides a concrete strategy for dealing with the basic issue described above. But let me know if you disagree. |
Here is my reasoning for doing this, which I've added to the SI (for context see "Reckoning mutations with respect to the reference experiment"):
@jgallowa07: if you agree with this logic, I think the thing to do would be to include these sites in the list of disallowed sites that gets used to identify variants to discard. This would only apply to sites where the forward mutation is missing from the reference experiment and the reverse mutation is missing from the non-reference experiment.
The text was updated successfully, but these errors were encountered: