-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-organization.Rmd
333 lines (221 loc) · 12.8 KB
/
06-organization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# Project Organization
```{r include = FALSE}
knitr::opts_chunk$set(
engine.opts = list(bash = "-i"))
```
**Note: this .Rmd is run in bash as there are no zsh specific commands and there are several bash-specific commands. If you are running zsh, type bash into your terminal to switch shells before you begin.**
Getting your project started
Project organization is one of the most important parts of a sequencing project, and yet is often overlooked amidst the
excitement of getting a first look at new data. Of course, while it's best to get yourself organized before you even begin your analyses,
it's never too late to start, either.
You should approach your sequencing project similarly to how you do a biological experiment and this ideally begins with experimental design. We're going to assume that you've already designed a beautiful
sequencing experiment to address your biological question, collected appropriate samples, and that you have
enough statistical power to answer the questions you're interested in asking. These
steps are all incredibly important, but beyond the scope of our course.
For all of those steps (collecting specimens, extracting DNA, prepping your samples)
you've likely kept a lab notebook that details how and why you did each step. However, the process of documentation doesn't stop at
the sequencer!
Genomics projects can quickly accumulate hundreds of files across
tens of folders. Every computational analysis you perform over the course of your project is going to create
many files, which can especially become a problem when you'll inevitably want to run some of those
analyses again. For instance, you might have made significant headway into your project, but then have to remember the PCR conditions
you used to create your sequencing library months prior.
Other questions might arise along the way:
- What were your best alignment results?
- Which folder were they in: Analysis1, AnalysisRedone, or AnalysisRedone2?
- Which quality cutoff did you use?
- What version of a given program did you implement your analysis in?
Good documentation is key to avoiding this issue, and luckily enough,
recording your computational experiments is even easier than recording lab data. Copy/Paste will become
your best friend, sensible file names will make your analysis understandable by you and your collaborators, and
writing the methods section for your next paper will be easy! Remember that in any given project of yours, it's worthwhile to consider
a future version of yourself as an entirely separate collaborator. The better your documenation is, the more this 'collaborator' will
feel indebted to you!
With this in mind, let's have a look at the best practices for
documenting your genomics project. Your future self will thank you.
In this exercise we will setup a file system for the project we will be working on for the variant section.
We will start by creating a directory that we can use for the variant section. First navigate to your home directory. Then confirm that you are in the correct directory using the `pwd` command.
```{bash}
cd ~
pwd
```
You should see the output:
/Users/your username
## Tip
If you aren't in your home directory, the easiest way to get there is to enter the command `cd ~`, which
always returns you to home.
## Exercise
Use the `mkdir` command to make the following directories:
- `dc_workshop`
- `dc_workshop/docs`
- `dc_workshop/data`
- `dc_workshop/results`
**Note: if you've already downloaded the fastq files into ~/dc_workshop/data/untrimmed_fastq be careful not to remake the data directory**
## Solution
```{bash}
```
Use `ls -R` to verify that you have created these directories. The `-R` option for `ls` stands for recursive. This option causes
`ls` to return the contents of each subdirectory within the directory
iteratively.
```{bash}
cd ~
ls -R dc_workshop
```
You should see the following output:
dc_workshop/:
data docs results
dc_workshop/data:
dc_workshop/docs:
dc_workshop/results:
## Organizing your files
Before beginning any analysis, it's important to save a copy of your
raw data. The raw data should never be changed. Regardless of how
sure you are that you want to carry out a particular data cleaning
step, there's always the chance that you'll change your mind later
or that there will be an error in carrying out the data cleaning and
you'll need to go back a step in the process. Having a raw copy of
your data that you never modify guarantees that you will always be
able to start over if something goes wrong with your analysis. When
starting any analysis, you can make a copy of your raw data file and
do your manipulations on that file, rather than the raw version. We
learned in [a previous episode](http://www.datacarpentry.org/shell-genomics/03-working-with-files/#file-permissions) how to prevent overwriting our raw data
files by setting restrictive file permissions.
You can store any results that are generated from your analysis in
the `results` folder. This guarantees that you won't confuse results
file and data files in six months or two years when you are looking
back through your files in preparation for publishing your study.
The `docs` folder is the place to store any written analysis of your
results, notes about how your analyses were carried out, and
documents related to your eventual publication.
## Documenting your activity on the project
When carrying out wet-lab analyses, most scientists work from a
written protocol and keep a hard copy of written notes in their lab
notebook, including any things they did differently from the
written protocol. This detailed
record-keeping process is just as important when doing computational
analyses. Luckily, it's even easier to record the steps you've
carried out computational than it is when working at the bench.
The `history` command is a convenient way to document all the
commands you have used while analyzing and manipulating your project
files. Let's document the work we have done on our project so far.
View the commands that you have used so far during this session using `history`:
```{bash}
history | head -n 10
```
The history likely contains many more commands than you have used for the current project. Let's view the last
several commands that focus on just what we need for this project.
View the last n lines of your history (where n = approximately the last few lines you think relevant). For our example, we will use the last 7:
```{bash}
history | tail -n 7
```
## Exercise
Using your knowledge of the shell, use the append redirect `>>` to create a file called
`dc_workshop_log_XXXX_XX_XX.sh` (Use the four-digit year, two-digit month, and two digit day, e.g.
`dc_workshop_log_2022_11_.27.sh`)
## Solution
```{bash}
```
Note we used the last 7 lines as an example, the number of lines may vary.
You may have noticed that your history contains the `history` command itself. To remove this redundancy
from our log, let's use the `nano` text editor to fix the file:
```{bash, eval=FALSE, engine="sh"}
nano dc_workshop_log_2022_11_27.sh
```
(Remember to replace the `XXXX_XX_XX` with your workshop date.)
From the `nano` screen, you can use your cursor to navigate, type, and delete any redundant lines.
## Navigating in Nano
Although `nano` is useful, it can be frustrating to edit documents, as you
can't use your mouse to navigate to the part of the document you would like to edit.
Here are some useful keyboard shortcuts for moving around within a text document in
`nano`. You can find more information by typing <kbd>Ctrl</kbd>-<kbd>G</kbd within `nano`.
| key | action |
| ------- | ---------- |
| <kbd>Ctrl</kbd>-<kbd>Space</kbdOR <kbd>Ctrl</kbd>-<kbd>→</kbd| to move forward one word |
| <kbd>Alt</kbd>-<kbd>Space</kbdOR <kbd>Esc</kbd>-<kbd>Space</kbdOR <kbd>Ctrl</kbd>-<kbd>←</kbd| to move back one word |
| <kbd>Ctrl</kbd>-<kbd>A</kbd| to move to the beginning of the current line |
| <kbd>Ctrl</kbd>-<kbd>E</kbd| to move to the end of the current line |
| <kbd>Ctrl</kbd>-<kbd>W</kbd| to search |
Add a date line and comment to the line where you have created the directory. Recall that any
text on a line after a `#` is ignored by zsh when evaluating the text as code. For example:
`#` 2022_11_27 Created sample directories for the Data Carpentry workshop
(You don't need the backticks, but I need them here because otherwise rmarkdown will interpret them as as a header)
Next, remove any lines of the history that are not relevant by navigating to those lines and using your
delete key. Save your file and close `nano`.
Your file should look something like this:
2022_11_27
Created sample directories for the Data Carpentry workshop
cd ~
mkdir -p dc_workshop
mkdir -p dc_workshop/docs
mkdir -p dc_workshop/data
mkdir -p dc_workshop/results
**Note: the -p is an option; it won't give an error in case the directory already exists.**
If you keep this file up to date, you can use it to re-do your work on your project if something happens to your results files. To demonstrate how this works, first delete
your `dc_workshop` directory and all of its subdirectories. Look at your directory
contents to verify the directory is gone.
**Note: I have this as eval=FALSE, because I have files saved there already, if you don't you should run it.**
```{bash, eval=FALSE, engine="sh"}
cd ~
rm -r dc_workshop
ls
```
Then run your workshop log file as a basn script. You should see the `dc_workshop`
directory and all of its subdirectories reappear.
**Note: I have this as eval=FALSE, but you'll want to run it in your terminal, unless you have files saved there already, then just modify your script accordingly.**
```{bash, eval=FALSE, engine="sh"}
cd ~
bash dc_workshop_log_2022_11_27.sh
ls
```
shell_data dc_workshop dc_workshop_log_2022_11_27.sh
It's important that we keep our workshop log file outside of our `dc_workshop` directory
if we want to use it to recreate our work. It's also important for us to keep it up to
date by regularly updating with the commands that we used to generate our results files.
Congratulations! You've finished your introduction to using the shell for genomics
projects. You now know how to navigate your file system, create, copy, move,
and remove files and directories, and automate repetitive tasks using scripts and
wildcards. With this solid foundation, you're ready to move on to apply all of these new
skills to carrying out more sophisticated bioinformatics
analysis work. Don't worry if everything doesn't feel perfectly comfortable yet. We're
going to have many more opportunities for practice as we move forward on our
bioinformatics journey!
**If you don't already have them: Files for wrangling-genomics, run these commands in the TERMINAL**
First make these directories in your home directory by typing
```{bash, eval=FALSE, engine="sh"}
mkdir -p ~/dc_workshop/data/untrimmed_fastq
```
Then download these files into that directory (takes about 5' to run):
**macOS**
```{bash, eval=FALSE, engine="sh"}
cd ~/dc_workshop/data/untrimmed_fastq
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
```
**Windows**
```{bash, eval=FALSE, engine="sh"}
cd ~/dc_workshop/data/untrimmed_fastq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/004/SRR2589044/SRR2589044_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
```
We will also download a set of trimmed FASTQ files to work with. These are small subsets of our real trimmed data,
and will enable us to run our variant calling workflow quite quickly.
I couldn't get this to work -- just download it manually by going to the site, then you can use the commands starting with mkdir -p after you've downloaded the sub.tar.gz file.
**MacOS & Windows**
```{bash, eval=FALSE, engine="sh"}
cd ~/dc_workshop
curl L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
mkdir -p ~/dc_workshop/data/trimmed_fastq_small
tar -xvf sub.tar.gz
mv -v sub/* dc_workshop/data/trimmed_fastq_small
rm -r sub
```
## References
[A Quick Guide to Organizing Computational Biology Projects](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)