Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable longer reads by default for longer insert protocols #54

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bwlang
Copy link
Contributor

@bwlang bwlang commented Aug 1, 2021

No description provided.

@bwlang bwlang requested review from wm75 and mvdbeek August 1, 2021 19:18
@bwlang
Copy link
Contributor Author

bwlang commented Aug 1, 2021

@wm75
Copy link
Contributor

wm75 commented Aug 1, 2021

Interesting! ... and good to know about the protocol.

The problem is, however, that this will worsen accuracy of the WF in the common ARTIC primer case. A length of 2000 nts would definitely be indicative of a chimeric read.

The current defaults are taken from https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html - the Read Filtering section in there (sorry no permalink to the chapter available). What would be cool to do in the WF would be to analyze the primer scheme and calculate the length limits from it (like suggested in that Read Filtering section). If you have time to think about an implementation for this, it would be a most welcome improvement.

Otherwise the only thing I think should/could be done here is explain in the WF param help what reasonable defaults are (you could just copy/paste that from the ARTIC page I guess, or come up with your own rule).

@bwlang
Copy link
Contributor Author

bwlang commented Aug 1, 2021

chimeras are not very toxic since we have a reference genome to help here.
however - filtering out too may reads is very bad for contiguity.

I compared 700 vs 2000 max length on some real ARTIC ONT datasets.
2000 produces superior consensus sequences:

2000 max length filters out 148 reads with the 2000 max length vs. ~11k reads with a the 700 default.

coverage with upper limit 2000:
image
vs with upper limit 700:
image
very similar.
however comparing the consensus sequences we see material improvements at the ends (less ambiguity):

extreme 5' :
image

extreme 3':
image

otherwise the consensus sequences are identical.

I think think it's wise to increase this default for all protocols.

max 2000 data:
https://usegalaxy.eu/u/brad_langhorst/h/varskip-short-covid-19-variation-analysis-of-artic-ont-data-2000-max-length

max 700 data:
https://usegalaxy.eu/u/brad_langhorst/h/varskip-short-atcc1986-example-history-ont

@wm75
Copy link
Contributor

wm75 commented Aug 1, 2021

Hmm, I need to think about this and discuss it with a few people. I do understand that a higher upper threshold leads to better coverage, but I'm less sure that this never decreases variant detection/consensus accuracy.

@bwlang
Copy link
Contributor Author

bwlang commented Aug 2, 2021

I did a bam diff (locally since the tool seems to be failing in usegalaxy.eu)

bam diff --noPhoneHome --in1 s06_500k_2000.bam --in2 s06_500k_700.bam --out 2000_vs_700.bam

produced: 2000_vs_700_only1_s06_500k_2000.bam
image

Looks like minimap handles these nicely.

e.g.

Read name = 2f2f2a20-d4a4-471d-bf09-cd52d467f0ea
Read length = 1,246bp
Flags = 0
----------------------
Mapping = Primary @ MAPQ 60
Reference span = NC_045512.2:47-559 (+) = 513bp
Cigar = 713S136M1I74M1D13M1D35M1D65M1I15M1D15M1D155M23S
Clipping = Left 713 soft; Right 23 soft
----------------------
SupplementaryAlignments
NC_045512.2:23-582 (-) = 559bp @MAPQ 60 NM29
----------------------
s1 = 467
s2 = 0
NM = 10
AS = 1048
de = 0.0178
rl = 0
cm = 77
nn = 0
tp = P
ms = 1048
Hidden tags: SA, XALocation = NC_045512.2:215
Base = A @ QV 18

@bwlang
Copy link
Contributor Author

bwlang commented Aug 2, 2021

I could imagine some mis-clipping at chimera boundaries, but I think these effects will eliminated by ivar trim.

@@ -103,7 +103,7 @@
"y": 710.5
},
"tool_id": null,
"tool_state": "{\"default\": 700, \"parameter_type\": \"integer\", \"optional\": true}",
"tool_state": "{\"default\": 2000, \"parameter_type\": \"integer\", \"optional\": true}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if there is some benefit in doing this we should probably make this a workflow parameter ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fine with a workflow parameter, but I think it's good to also set the default based on this data (or other generate more data?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants