-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-working-with-files.Rmd
549 lines (317 loc) · 16.5 KB
/
03-working-with-files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
# Working with Files and Directories
```{r include = FALSE}
knitr::opts_chunk$set(
engine = list(bash = 'bash',zsh ='zsh'),
engine.path = list(zsh = '/bin/zsh',bash ='/bin/bash'),
engine.opts = list(bash = "-i", zsh = "-i"))
```
## Working with Files
### Our data set: FASTQ files
Now that we know how to navigate around our directory structure, let's
start working with our sequencing files. We did a sequencing experiment and
have two results files, which are stored in our `untrimmed_fastq` directory.
### Wildcards
Navigate to your `untrimmed_fastq` directory:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
```
We are interested in looking at the FASTQ files in this directory. We can list
all files with the .fastq extension using the command:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
ls *.fastq
```
The `*` character is a special type of character called a wildcard, which can be used to represent any number of any type of character.
Thus, `*.fastq` matches every file that ends with `.fastq`.
This command:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
ls *977.fastq
```
lists only the file that ends with `977.fastq`.
This command:
```{zsh, engine.opts="-i"}
ls /usr/bin/*.sh
```
Lists every file in `/usr/bin` that ends in the characters `.sh`.
Note that the output displays __full__ paths to files, since
each result starts with `/`.
## Exercise
Do each of the following tasks from your current directory using a single
`ls` command for each:
1. List all of the files in `/usr/bin` that start with the letter 'c'.
2. List all of the files in `/usr/bin` that contain the letter 'a'.
3. List all of the files in `/usr/bin` that end with the letter 'o'.
Bonus: List all of the files in `/usr/bin` that contain the letter 'a' or the
letter 'c'.
Hint: The bonus question requires a Unix wildcard that we haven't talked about
yet. Try searching the internet for information about Unix wildcards to find
what you need to solve the bonus problem.
## Solution
## Exercise
`echo` is a built-in shell command that writes its arguments, like a line of text to standard output.
The `echo` command can also be used with pattern matching characters, such as wildcard characters.
Here we will use the `echo` command to see how the wildcard character is interpreted by the shell.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
echo *.fastq
```
The `*` is expanded to include any file that ends with `.fastq`. We can see that the output of
`echo *.fastq` is the same as that of `ls *.fastq`.
What would the output look like if the wildcard could *not* be matched? Compare the outputs of
`echo *.missing` and `ls *.missing`.
## Solution
## Command History
If you want to repeat a command that you've run recently, you can access previous
commands using the up arrow on your keyboard to go back to the most recent
command. Likewise, the down arrow takes you forward in the command history.
A few more useful shortcuts:
- <kbdCtrl</kbd+<kbdC</kbdwill cancel the command you are writing, and give you a
fresh prompt.
- <kbdCtrl</kbd+<kbdR</kbd will do a reverse-search through your command history. This
is very useful.
- <kbdCtrl</kbd+<kbdL</kbdor the `clear` command will clear your screen.
You can also review your recent commands with the `history` command, by entering:
```{bash, engine.opts="-i"}
history 10
```
to see a numbered list of recent commands. You can reuse one of these commands
directly by referring to the number of that command.
For example, if your history looked like this:
259 ls *
260 ls /usr/bin/*.sh
261 ls *R1*fastq
then you could repeat command #0 by entering:
```{bash, eval=FALSE, engine="sh"}
!50
```
Type `!` (exclamation point) and then the number of the command from your history.
You will be glad you learned this when you need to re-run very complicated commands.
For more information on advanced usage of `history`, read section 9.3 of
[Bash manual](https://www.gnu.org/software/bash/manual/html_node/index.html).
## Exercise
Find the line number in your history for the command that listed all the .sh
files in `/usr/bin`. Rerun that command.
## Solution
Examining Files
We now know how to switch directories, run programs, and look at the
contents of directories, but how do we look at the contents of files?
One way to examine a file is to print out all of the
contents using the program `cat`. We don't want to do
that with a fastq file because they are too big. We can
use head with the option -n for the first n lines, e.g.
for lines 1 to 10, head -n 10. If we want to print specific
lines we can use sed
by sed with the option -n followed by the specific
line numbers you want to print, e.g. for lines 3 to 8 you
would write sed -n 3,8p
Enter the following command from within the `untrimmed_fastq` directory:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
sed -n 3,8p SRR098026.fastq
```
This will print out the 3rd and 8th lines of the `SRR098026.fastq ` to the screen.
`cat` is a terrific program, but when the file is really big, it can
be annoying to use. The program, `less`, is useful for this
case. `less` opens the file as read only, and lets you navigate through it. The navigation commands
are identical to the `man` program.
Enter the following command:
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
less SRR097977.fastq
q
```
Some navigation commands in `less`:
| key | action |
| ------- | ---------- |
| <kbd>Space</kbd> | to go forward |
| <kbd>b</kbd> | to go backward |
| <kbd>g</kbd> | to go to the beginning |
| <kbd>G</kbd> | to go to the end |
| <kbd>q</kbd> | to quit |
`less` also gives you a way of searching through files. Use the
"/" key to begin a search. Enter the word you would like
to search for and press `enter`. The screen will jump to the next location where
that word is found.
**Shortcut:**
If you hit "/" then "enter", `less` will repeat
the previous search. `less` searches from the current location and
works its way forward. Scroll up a couple lines on your terminal to verify
you are at the beginning of the file. Note, if you are at the end of the file and search
for the sequence "CAA", `less` will not find it. You either need to go to the
beginning of the file (by typing `g`) and search again using `/` or you
can use `?` to search backwards in the same way you used `/` previously.
For instance, let's search forward for the sequence `TTTTT` in our file.
You can see that we go right to that sequence, what it looks like,
and where it is in the file. If you continue to type `/` and hit return, you will move
forward to the next instance of this sequence motif. If you instead type `?` and hit
return, you will search backwards and move up the file to previous examples of this motif.
## Exercise
What are the next three nucleotides (characters) after the first instance of the sequence quoted above?
## Solution
Remember, the `man` program actually uses `less` internally and
therefore uses the same commands, so you can search documentation
using "/" as well!
There's another way that we can look at files, and in this case, just
look at part of them. This can be particularly useful if we just want
to see the beginning or end of the file, or see how it's formatted.
The commands are `head` and `tail` and they let you look at
the beginning and end of a file, respectively.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
head -n 4 SRR098026.fastq
```
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
tail -n 1 SRR098026.fastq
```
The `-n` option to either of these commands can be used to print the
first or last `n` lines of a file.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
head -n 1 SRR098026.fastq
```
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
tail -n 1 SRR098026.fastq
```
## Details on the FASTQ format
Although it looks complicated (and it is), it's easy to understand the
[fastq](https://en.wikipedia.org/wiki/FASTQ_format) format with a little decoding. Some rules about the format
include...
|Line|Description|
|----|-----------|
|1|Always begins with '@' and then information about the read|
|2|The actual DNA sequence|
|3|Always begins with a '+' and sometimes the same info in line 1|
|4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|
We can view the first complete read in one of the files in our dataset by using `head` to look at
the first four lines.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
head -n 4 SRR098026.fastq
```
Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the
probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call
accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality
score, the numerical score is converted into a code where each individual character
represents the numerical quality score for an individual nucleotide.
The `#` character and each of the `!` characters represent the encoded quality for an
individual nucleotide. The numerical value assigned to each of these characters depends on the
sequencing platform that generated the reads. The sequencing machine used to generate our data
uses the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards.
Each character is assigned a quality score between 0 and 42 as shown in the chart below.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=?@ABCDEFGHIJK
| | | | |
Quality score: 0........10........20........30........40..
Each quality score represents the probability that the corresponding nucleotide call is
incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a
base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%.
These probability values are the results from the base calling algorithm and dependent on how
much signal was captured for the base incorporation.
## Exercise
1. What is the last line of the `~/shell_data/untrimmed_fastq/SRR097977.fastq` file?
2. From your home directory, and without changing directories,
use one short command to print the contents of all of the files in
the `~/shell_data/untrimmed_fastq` directory into one file.
## Solution
## Creating, moving, copying, and removing
Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move
them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line,
but there will be some cases where it will be
impossible. You'll also find that you may be working with hundreds of files and want to do similar manipulations to all
of those files. In cases like this, it's much faster to do these operations at the command line.
### Copying Files
When working with computational data, it's important to keep a safe copy of that data that can't be accidentally overwritten or deleted.
For this lesson, our raw data is our FASTQ files. We don't want to accidentally change the original files, so we'll make a copy of them
and change the file permissions so that we can read from, but not write to, the files.
First, let's make a copy of one of our FASTQ files using the `cp` command.
Navigate to the `shell_data/untrimmed_fastq` directory and enter:
cp SRR098026.fastq SRR098026-copy.fastq .fastq
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
cp SRR098026.fastq SRR098026-copy.fastq
```
We now have two copies of the `SRR098026.fastq ` file, one of them named `SRR098026-copy.fastq .fastq`. We'll move this file to a new directory
called `backup` where we'll store our backup data files.
### Creating Directories
The `mkdir` command is used to make a directory. Enter `mkdir`
followed by a space, then the directory name you want to create:
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
mkdir backup
```
### Moving / Renaming
We can now move our backup file to this directory. We can
move files around using the command `mv`:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
cp SRR098026.fastq SRR098026-copy.fastq
mkdir -p backup
mv SRR098026-copy.fastq backup
ls backup
```
The `mv` command is also how you rename files. Let's rename this file to make it clear that this is a backup:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq/backup
mv SRR098026-copy.fastq SRR098026-backup.fastq
ls
```
### File Permissions
We've now made a backup copy of our file, but just because we have two copies, it doesn't make us safe. We can still accidentally delete or
overwrite both copies. To make sure we can't accidentally mess up this backup file, we're going to change the permissions on the file so
that we're only allowed to read (i.e. view) the file, not write to it (i.e. make new changes).
View the current permissions on a file using the `-l` (long) flag for the `ls` command:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
ls -l
```
The first part of the output for the `-l` flag gives you information about the file's current permissions. There are ten slots in the
permissions list. The first character in this list is related to file type, not permissions, so we'll ignore it for now. The next three
characters relate to the permissions that the file owner has, the next three relate to the permissions for group members, and the final
three characters specify what other users outside of your group can do with the file. We're going to concentrate on the three positions
that deal with your permissions (as the file owner).

Here the three positions that relate to the file owner are `rw-`. The `r` means that you have permission to read the file, the `w`
indicates that you have permission to write to (i.e. make changes to) the file, and the third position is a `-`, indicating that you
don't have permission to carry out the ability encoded by that space (this is the space where `x` or executable ability is stored, we'll
talk more about this in [a later lesson](http://www.datacarpentry.org/shell-genomics/05-writing-scripts/)).
Our goal for now is to change permissions on this file so that you no longer have `w` or write permissions. We can do this using the `chmod` (change mode) command and subtracting (`-`) the write permission `-w`.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq/backup
chmod -w SRR098026-backup.fastq
ls -l
```
### Removing
To prove to ourselves that you no longer have the ability to modify this file, try deleting it with the `rm` command:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq/backup
rm SRR098026-backup.fastq
```
You'll be asked if you want to override your file permissions, or it just won't remove the file.
If it does ask, you should enter `n` for no. If you enter `n` (for no), the file will not be deleted. If you enter `y`, you will delete the file. This gives us an extra
measure of security, as there is one more step between us and deleting our data files.
**Important**: The `rm` command permanently removes the file. Be careful with this command. It doesn't
just nicely put the files in the Trash. They're really gone.
By default, `rm` will not delete directories. You can tell `rm` to
delete a directory using the `-r` (recursive) option. Let's delete the backup directory
we just made.
Enter the following command:
The -r tells it to remove the directory...
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq/
rm -r backup
```
Now you'll be asked if you want to override your file permissions for backup/SRR098026-backup.fastq
This will delete not only the directory, but all files within the directory. If you have write-protected files in the directory,
you will be asked whether you want to override your permission settings.
I'm going to say yes because you're going to regenerate it in the next exercise!
## Exercise
Starting in the `shell_data/untrimmed_fastq/` directory, do the following:
1. Make sure that you have deleted your backup directory and all files it contains.
2. Create a backup of SRR098026.fastq and SRR097977.fastq using `cp`. (Note: You'll need to do this individually for each of the two FASTQ files. We haven't
learned yet how to do this
with a wildcard.)
3. Use a wildcard to move all of your backup files to a new backup directory.
4. Change the permissions on all of your backup files to be write-protected.
## Solution