Skip to content

Commit 52394f4

Browse files
authored
Merge pull request #4 from PMCC-BioinformaticsCore/metadata
Metadata
2 parents e4e9355 + 298b0fc commit 52394f4

File tree

1 file changed

+300
-0
lines changed

1 file changed

+300
-0
lines changed
Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
---
2+
title: "**Nextflow Development - Metadata Proprogation**"
3+
output:
4+
html_document:
5+
toc: false
6+
toc_float: false
7+
from: markdown+emoji
8+
---
9+
10+
::: callout-tip
11+
### Objectives{.unlisted}
12+
- Gain and understanding of how to manipulate and proprogate metadata
13+
:::
14+
15+
## **Environment Setup**
16+
17+
Set up an interactive shell to run our Nextflow workflow:
18+
19+
``` default
20+
srun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash
21+
```
22+
23+
Load the required modules to run Nextflow:
24+
25+
``` default
26+
module load nextflow/23.04.1
27+
module load singularity/3.7.3
28+
```
29+
30+
Set the singularity cache environment variable:
31+
32+
```default
33+
export NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow
34+
```
35+
36+
Singularity images downloaded by workflow executions will now be stored in this directory.
37+
38+
You may want to include these, or other environmental variables, in your `.bashrc` file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found [here](https://www.nextflow.io/docs/latest/config.html#environment-variables).
39+
40+
The training data can be cloned from:
41+
```default
42+
git clone https://github.com/nextflow-io/training.git
43+
```
44+
45+
## 7.1 **Metadata Parsing**
46+
We have covered a few different methods of metadata parsing.
47+
48+
49+
### **7.1.1 First Pass: `.fromFilePairs`**
50+
51+
A first pass attempt at pulling these files into Nextflow might use the fromFilePairs method:
52+
```default
53+
workflow {
54+
Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
55+
.view
56+
}
57+
```
58+
Nextflow will pull out the first part of the fastq filename and returned us a channel of tuple elements where the first element is the filename-derived ID and the second element is a list of two fastq files.
59+
60+
The id is stored as a simple string. We'd like to move to using a map of key-value pairs because we have more than one piece of metadata to track. In this example, we have sample, replicate, tumor/normal, and treatment. We could add extra elements to the tuple, but this changes the 'cardinality' of the elements in the channel and adding extra elements would require updating all downstream processes. A map is a single object and is passed through Nextflow channels as one value, so adding extra metadata fields will not require us to change the cardinality of the downstream processes.
61+
62+
There are a couple of different ways we can pull out the metadata
63+
64+
We can use the tokenize method to split our id. To sanity-check, I just pipe the result directly into the view operator.
65+
```default
66+
workflow {
67+
Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
68+
.map { id, reads ->
69+
tokens = id.tokenize("_")
70+
}
71+
.view
72+
}
73+
```
74+
75+
If we are confident about the stability of the naming scheme, we can destructure the list returned by tokenize and assign them to variables directly:
76+
```default
77+
map { id, reads ->
78+
(sample, replicate, type) = id.tokenize("_")
79+
meta = [sample:sample, replicate:replicate, type:type]
80+
[meta, reads]
81+
}
82+
```
83+
84+
::: callout-note
85+
```default
86+
Make sure that you're using a tuple with parentheses e.g. (one, two) rather than a List e.g. [one, two]
87+
```
88+
:::
89+
90+
If we move back to the previous method, but decided that the 'rep' prefix on the replicate should be removed, we can use regular expressions to simply "subtract" pieces of a string. Here we remove a 'rep' prefix from the replicate variable if the prefix is present:
91+
92+
```default
93+
map { id, reads ->
94+
(sample, replicate, type) = id.tokenize("_")
95+
replicate -= ~/^rep/
96+
meta = [sample:sample, replicate:replicate, type:type]
97+
[meta, reads]
98+
}
99+
```
100+
101+
By setting up our the "meta", in our tuple with the format above, allows us to access the values in "sample" throughout our modules/configs as `${meta.sample}`.
102+
103+
## **Second Parse: `.splitCsv`**
104+
We have briefly touched on `.splitCsv` in the first week.
105+
106+
As a quick overview
107+
108+
Assuming we have the samplesheet
109+
```default
110+
sample_name,fastq1,fastq2
111+
gut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq
112+
liver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq
113+
lung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq
114+
```
115+
116+
We can set up a workflow to read in these files as:
117+
118+
```default
119+
params.reads = "/.../rnaseq_samplesheet.csv"
120+
121+
reads_ch = Channel.fromPath(params.reads)
122+
reads_ch.view()
123+
reads_ch = reads_ch.splitCsv(header:true)
124+
reads_ch.view()
125+
```
126+
127+
128+
::: callout-tip
129+
## Challenge{.unlisted}
130+
Using `.splitCsv` and `.map` read in the samplesheet below:
131+
`/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv`
132+
133+
Set the meta to contain the following keys from the header `id`, `repeat` and `type`
134+
:::
135+
136+
:::{.callout-caution collapse="true"}
137+
## Solution
138+
```default
139+
params.input = "/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv"
140+
141+
ch_sheet = Channel.fromPath(params.input)
142+
143+
ch_sheet.splitCsv(header:true)
144+
.map {
145+
it ->
146+
[[it.id, it.repeat, it.type], it.fastq_1, it.fastq_2]
147+
}.view()
148+
149+
150+
```
151+
:::
152+
153+
## **7.2 Manipulating Metadata and Channels**
154+
There are a number of use cases where we will be interested in manipulating our metadata and channels.
155+
156+
Here we will look at 2 use cases.
157+
158+
### **7.2.1 Matching input channels**
159+
As we have seen in examples/challenges in the operators section, it is important to ensure that the format of the channels that you provide as inputs match the process definition.
160+
161+
```default
162+
params.reads = "/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/*_{1,2}.fq"
163+
164+
process printNumLines {
165+
input:
166+
path(reads)
167+
168+
output:
169+
path("*txt")
170+
171+
script:
172+
"""
173+
wc -l ${reads}
174+
"""
175+
}
176+
177+
workflow {
178+
ch_input = Channel.fromFilePairs("$params.reads")
179+
printNumLines( ch_input )
180+
}
181+
```
182+
183+
As if the format does not match you will see and error similar to below:
184+
```default
185+
[myeung@papr-res-compute204 lesson7.1test]$ nextflow run test.nf
186+
N E X T F L O W ~ version 23.04.1
187+
Launching `test.nf` [agitated_faggin] DSL2 - revision: c210080493
188+
[- ] process > printNumLines -
189+
```
190+
or if using nf-core template
191+
192+
```default
193+
ERROR ~ Error executing process > 'PMCCCGTRC_UMIHYBCAP:UMIHYBCAP:PREPARE_GENOME:BEDTOOLS_SLOP'
194+
195+
Caused by:
196+
Not a valid path value type: java.util.LinkedHashMap ([id:genome_size])
197+
198+
199+
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
200+
201+
-- Check '.nextflow.log' file for details
202+
```
203+
204+
When encountering these errors there are two methods to correct this:
205+
206+
1. Change the `input` definition in the process
207+
2. Use variations of the channel operators to correct the format of your channel
208+
209+
There are cases where changing the `input` definition is impractical (i.e. when using nf-core modules/subworkflows).
210+
211+
Let's take a look at some select modules.
212+
213+
[`BEDTOOLS_SLOP`](https://github.com/nf-core/modules/blob/master/modules/nf-core/bedtools/slop/main.nf)
214+
215+
[`BEDTOOLS_INTERSECT`](https://github.com/nf-core/modules/blob/master/modules/nf-core/bedtools/intersect/main.nf)
216+
217+
218+
::: callout-tip
219+
## Challenge{.unlisted}
220+
Assuming that you have the following inputs
221+
222+
```default
223+
ch_target = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed")
224+
ch_bait = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed").map { fn -> [ [id: fn.baseName ], fn ] }
225+
ch_sizes = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes")
226+
```
227+
228+
Write a mini workflow that:
229+
230+
1. Takes the `ch_target` bedfile and extends the bed by 20bp on both sides using `BEDTOOLS_SLOP` (You can use the config definition below as a helper, or write your own as an additional challenge)
231+
2. Take the output from `BEDTOOLS_SLOP` and input this output with the `ch_baits` to `BEDTOOLS_INTERSECT`
232+
233+
HINT: The modules can be imported from this location: `/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools`
234+
235+
HINT: You will need need the following operators to achieve this `.map` and `.combine`
236+
:::
237+
238+
::: {.callout-note collapse="true"}
239+
## Config
240+
```default
241+
242+
process {
243+
withName: 'BEDTOOLS_SLOP' {
244+
ext.args = "-b 20"
245+
ext.prefix = "extended.bed"
246+
}
247+
248+
withName: 'BEDTOOLS_INTERSECT' {
249+
ext.prefix = "intersect.bed"
250+
}
251+
}
252+
:::
253+
254+
:::{.callout-caution collapse="true"}
255+
## **Solution**
256+
```default
257+
include { BEDTOOLS_SLOP } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/slop/main'
258+
include { BEDTOOLS_INTERSECT } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/intersect/main'
259+
260+
261+
ch_target = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed")
262+
ch_bait = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed").map { fn -> [ [id: fn.baseName ], fn ] }
263+
ch_sizes = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes")
264+
265+
workflow {
266+
BEDTOOLS_SLOP ( ch_target.map{ fn -> [ [id:fn.baseName], fn ]}, ch_sizes)
267+
268+
target_bait_bed = BEDTOOLS_SLOP.out.bed.combine( ch_bait )
269+
BEDTOOLS_INTERSECT( target_bait_bed, ch_sizes.map{ fn -> [ [id: fn.baseName], fn]} )
270+
}
271+
```
272+
273+
```default
274+
nextflow run nfcoretest.nf -profile singularity -c test2.config --outdir nfcoretest
275+
```
276+
:::
277+
278+
## **7.3 Grouping with Metadata**
279+
Earlier we introduced the function `groupTuple`
280+
281+
282+
```default
283+
284+
ch_reads = Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
285+
.map { id, reads ->
286+
(sample, replicate, type) = id.tokenize("_")
287+
replicate -= ~/^rep/
288+
meta = [sample:sample, replicate:replicate, type:type]
289+
[meta, reads]
290+
}
291+
292+
## Assume that we want to drop replicate from the meta and combine fastqs
293+
294+
ch_reads.map {
295+
meta, reads ->
296+
[ meta - meta.subMap('replicate') + [data_type: 'fastq'], reads ]
297+
}
298+
.groupTuple().view()
299+
```
300+

0 commit comments

Comments
 (0)