Enable longer reads by default for longer insert protocols #54

bwlang · 2021-08-01T19:18:41Z

No description provided.

bwlang · 2021-08-01T19:39:54Z

see: https://usegalaxy.eu/u/brad_langhorst/h/varskip-long-covid-19-variation-analysis-on-artic-ont-data-1 for example data

wm75 · 2021-08-01T19:52:47Z

Interesting! ... and good to know about the protocol.

The problem is, however, that this will worsen accuracy of the WF in the common ARTIC primer case. A length of 2000 nts would definitely be indicative of a chimeric read.

The current defaults are taken from https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html - the Read Filtering section in there (sorry no permalink to the chapter available). What would be cool to do in the WF would be to analyze the primer scheme and calculate the length limits from it (like suggested in that Read Filtering section). If you have time to think about an implementation for this, it would be a most welcome improvement.

Otherwise the only thing I think should/could be done here is explain in the WF param help what reasonable defaults are (you could just copy/paste that from the ARTIC page I guess, or come up with your own rule).

bwlang · 2021-08-01T20:31:33Z

chimeras are not very toxic since we have a reference genome to help here.
however - filtering out too may reads is very bad for contiguity.

I compared 700 vs 2000 max length on some real ARTIC ONT datasets.
2000 produces superior consensus sequences:

2000 max length filters out 148 reads with the 2000 max length vs. ~11k reads with a the 700 default.

coverage with upper limit 2000:

vs with upper limit 700:

very similar.
however comparing the consensus sequences we see material improvements at the ends (less ambiguity):

extreme 5' :

extreme 3':

otherwise the consensus sequences are identical.

I think think it's wise to increase this default for all protocols.

max 2000 data:
https://usegalaxy.eu/u/brad_langhorst/h/varskip-short-covid-19-variation-analysis-of-artic-ont-data-2000-max-length

max 700 data:
https://usegalaxy.eu/u/brad_langhorst/h/varskip-short-atcc1986-example-history-ont

wm75 · 2021-08-01T20:40:03Z

Hmm, I need to think about this and discuss it with a few people. I do understand that a higher upper threshold leads to better coverage, but I'm less sure that this never decreases variant detection/consensus accuracy.

bwlang · 2021-08-02T01:19:14Z

I did a bam diff (locally since the tool seems to be failing in usegalaxy.eu)

bam diff --noPhoneHome --in1 s06_500k_2000.bam --in2 s06_500k_700.bam --out 2000_vs_700.bam

produced: 2000_vs_700_only1_s06_500k_2000.bam

Looks like minimap handles these nicely.

e.g.

Read name = 2f2f2a20-d4a4-471d-bf09-cd52d467f0ea
Read length = 1,246bp
Flags = 0
----------------------
Mapping = Primary @ MAPQ 60
Reference span = NC_045512.2:47-559 (+) = 513bp
Cigar = 713S136M1I74M1D13M1D35M1D65M1I15M1D15M1D155M23S
Clipping = Left 713 soft; Right 23 soft
----------------------
SupplementaryAlignments
NC_045512.2:23-582 (-) = 559bp @MAPQ 60 NM29
----------------------
s1 = 467
s2 = 0
NM = 10
AS = 1048
de = 0.0178
rl = 0
cm = 77
nn = 0
tp = P
ms = 1048
Hidden tags: SA, XALocation = NC_045512.2:215
Base = A @ QV 18

bwlang · 2021-08-02T01:21:56Z

I could imagine some mis-clipping at chimera boundaries, but I think these effects will eliminated by ivar trim.

mvdbeek · 2021-09-01T14:21:39Z

...flows/sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation.ga

@@ -103,7 +103,7 @@
                "y": 710.5
            },
            "tool_id": null,
-            "tool_state": "{\"default\": 700, \"parameter_type\": \"integer\", \"optional\": true}",
+            "tool_state": "{\"default\": 2000, \"parameter_type\": \"integer\", \"optional\": true}",


I think if there is some benefit in doing this we should probably make this a workflow parameter ?

i'm fine with a workflow parameter, but I think it's good to also set the default based on this data (or other generate more data?)

Enabel longer reads by default for longer insert protocols

177f5cb

bwlang requested review from wm75 and mvdbeek August 1, 2021 19:18

Actually 2000 is much better for this example data (1.4kb amplicons)

5b693e5

mvdbeek reviewed Sep 1, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable longer reads by default for longer insert protocols #54

Enable longer reads by default for longer insert protocols #54

bwlang commented Aug 1, 2021

bwlang commented Aug 1, 2021 •

edited

Loading

wm75 commented Aug 1, 2021

bwlang commented Aug 1, 2021

wm75 commented Aug 1, 2021

bwlang commented Aug 2, 2021

bwlang commented Aug 2, 2021

mvdbeek Sep 1, 2021

bwlang Sep 1, 2021

Enable longer reads by default for longer insert protocols #54

Are you sure you want to change the base?

Enable longer reads by default for longer insert protocols #54

Conversation

bwlang commented Aug 1, 2021

bwlang commented Aug 1, 2021 • edited Loading

wm75 commented Aug 1, 2021

bwlang commented Aug 1, 2021

wm75 commented Aug 1, 2021

bwlang commented Aug 2, 2021

bwlang commented Aug 2, 2021

mvdbeek Sep 1, 2021

Choose a reason for hiding this comment

bwlang Sep 1, 2021

Choose a reason for hiding this comment

bwlang commented Aug 1, 2021 •

edited

Loading