-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-writing-scripts.Rmd
300 lines (171 loc) · 10.9 KB
/
05-writing-scripts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# Writing Scripts and Working with Data
```{r include = FALSE}
knitr::opts_chunk$set(
engine = list('zsh'),
engine.path = list('/bin/zsh'),
engine.opts = list(zsh = "-i")
)
```
## Writing files
We've been able to do a lot of work with files that already exist, but what if we want to write our own files? We're not going to type in a FASTA file, but we'll see as we go through other tutorials, there are a lot of reasons we'll want to write a file, or edit an existing file.
To add text to files, we're going to use a text editor called Nano. We're going to create a file to take notes about what we've been doing with the data files in `~/shell_data/untrimmed_fastq`.
This is good practice when working in bioinformatics. We can create a file called `README.txt` that describes the data files in the directory or documents how the files in that directory were generated. As the name suggests, it's a file that we or others should read to understand the information in that directory.
Let's change our working directory to `~/shell_data/untrimmed_fastq` using `cd`,
then run `nano` to create a file called `README.txt`:
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
nano README.txt
```
You should see something like this:

The text at the bottom of the screen shows the keyboard shortcuts for performing various tasks in `nano`. We will talk more about how to interpret this information soon.
## Which Editor?
When we say, "`nano` is a text editor," we really do mean "text": `nano` can
only work with plain character data, not tables, images, or any other
human-friendly media. We use `nano` in examples because it is one of the
least complex text editors. However, because of this trait, `nano` may
not be powerful enough or flexible enough for the work you need to do
after this workshop. On Unix systems (such as Linux and Mac OS X),
many programmers use [Emacs](http://www.gnu.org/software/emacs/) or
[Vim](http://www.vim.org/) (both of which require more time to learn),
or a graphical editor such as
[Gedit](http://projects.gnome.org/gedit/). On Windows, you may wish to
use [Notepad++](http://notepad-plus-plus.org/). Windows also has a built-in
editor called `notepad` that can be run from the command line in the same
way as `nano` for the purposes of this lesson.
No matter what editor you use, you will need to know the default location where it searches
for files and where files are saved. If you start an editor from the shell, it will (probably)
use your current working directory as its default location. If you use
your computer's start menu, the editor may want to save files in your desktop or
documents directory instead. You can change this by navigating to
another directory the first time you "Save As..."
Let's type in a few lines of text. Describe what the files in this
directory are or what you've been doing with them.
Once we're happy with our text, we can press <kbd>Ctrl</kbd>-<kbd>O</kbd> (press the <kbd>Ctrl</kbd> or <kbd>Control</kbd> key and, while
holding it down, press the <kbd>O</kbd> key) to write our data to disk. You'll be asked what file we want to save this to:
press <kbd>Return</kbd> to accept the suggested default of `README.txt`.
Once our file is saved, we can use <kbd>Ctrl</kbd>-<kbd>X</kbd> to quit the `nano` editor and
return to the shell.
## Control, Ctrl, or ^ Key
The Control key is also called the "Ctrl" key. There are various ways
in which using the Control key may be described. For example, you may
see an instruction to press the <kbd>Ctrl</kbd> key and, while holding it down,
press the <kbd>X</kbd> key, described as any of:
*`Control-X`*
*`Control+X`*
*`Ctrl-X`*
*`Ctrl+X`*
*`^X`*
*`C-x`*
In `nano`, along the bottom of the screen you'll see `^G Get Help ^O WriteOut`.
This means that you can use <kbd>Ctrl</kbd>-<kbd>G</kbd> to get help and <kbd>Ctrl</kbd>-<kbd>O</kbd> to save your
file.
Now you've written a file. You can take a look at it with `less` or `cat`, or open it up again and edit it with `nano`.
## Exercise
Open `README.txt` and add the date to the top of the file and save the file.
## Solution
## Writing scripts
A really powerful thing about the command line is that you can write scripts. Scripts let you save commands to run them and also lets you put multiple commands together. Though writing scripts may require an additional time investment initially, this can save you time as you run them repeatedly. Scripts can also address the challenge of reproducibility: if you need to repeat an analysis, you retain a record of your command history within the script.
One thing we will commonly want to do with sequencing results is pull out bad reads and write them to a file to see if we can figure out what's going on with them. We're going to look for reads with long sequences of N's like we did before, but now we're going to write a script, so we can run it each time we get new sequences, rather than type the code in by hand each time.
We're going to create a new file to put this command in. We'll call it `bad-reads-script.sh`. The `sh` isn't required, but using that extension tells us that it's a shell script.
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
nano bad-reads-script.sh
```
Bad reads have a lot of N's, so we're going to look for `NNNNNNNNNN` with `grep`. We want the whole FASTQ record, so we're also going to get the one line above the sequence and the two lines below. We also want to look in all the files that end with `.fastq`, so we're going to use the `*` wildcard.
```{zsh, eval=FALSE, engine="sh"}
grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt
```
## Custom `grep` control
We introduced the `-v` option in [the previous episode](http://www.datacarpentry.org/shell-genomics/04-redirection/), now we
are using `-h` to "Suppress the prefixing of file names on output" according to the documentation shown by `man grep`.
Type your `grep` command into the file and save it as before. Be careful that you did not add the `$` at the beginning of the line.
Now comes the neat part. We can run this script. Type:
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
bash bad-reads-script.sh
```
It will look like nothing happened, but now if you look at `scripted_bad_reads.txt`, you can see that there are now reads in the file.
## Exercise
We want the script to tell us when it's done.
1. Open `bad-reads-script.sh` and add the line `echo "Script finished!"` after the `grep` command and save the file.
2. Run the updated script.
## Solution
## Making the script into a program
We had to type `zsh` because we needed to tell the computer what program to use to run this script. Instead, we can turn this script into its own program. We need to tell the computer that this script is a program by making the script file executable. We can do this by changing the file permissions. We talked about permissions in [an earlier episode](http://www.datacarpentry.org/shell-genomics/03-working-with-files/).
First, let's look at the current permissions.
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
ls -l bad-reads-script.sh
```
We see that it says `-rw-r--r--`. This shows that the file can be read by any user and written to by the file owner (you). We want to change these permissions so that the file can be executed as a program. We use the command `chmod` like we did earlier when we removed write permissions. Here we are adding (`+`) executable permissions (`+x`).
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
chmod +x bad-reads-script.sh
```
Now let's look at the permissions again.
```{zsh, engine.opts="-i", include=FALSE, echo = FALSE}
cd ~/shell_data/untrimmed_fastq
ls -lah bad-reads-script.sh
```
Now we see that it says `-rwxr-xr-x`. The `x`'s that are there now tell us we can run it as a program. So, let's try it! We'll need to put `./` at the beginning so the computer knows to look here in this directory for the program.
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
./bad-reads-script.sh
```
The script should run the same way as before, but now we've created our very own computer program!
You will learn more about writing scripts in [a later lesson](https://datacarpentry.org/wrangling-genomics/05-automation/index.html).
## Moving and Downloading Data
So far, we've worked with data that is pre-loaded on the instance in the cloud. Usually, however,
most analyses begin with moving data onto the instance. Below we'll show you some commands to
download data onto your instance, or to move data between your computer and the cloud.
### Getting data from the cloud
There are two programs that will download data from a remote server to your local
(or remote) machine: `wget` and `curl`. They were designed to do slightly different
tasks by default, so you'll need to give the programs somewhat different options to get
the same behaviour, but they are mostly interchangeable.
- `wget` is short for "world wide web get", and it's basic function is to *download*
web pages or data at a web address.
- `cURL` is a pun, it is supposed to be read as "see URL", so its basic function is
to *display* webpages or data at a web address.
Which one you need to use mostly depends on your operating system, as most computers will
only have one or the other installed by default.
Let's say you want to download some data from Ensembl. We're going to download a very small
tab-delimited file that just tells us what data is available on the Ensembl bacteria server.
Before we can start our download, we need to know whether we're using `curl` or `wget`.
To see which program you have, type:
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
which curl
which wget
```
`which` is a ZSH program that looks through everything you have
installed, and tells you what folder it is installed to. If it can't
find the program you asked for, it returns nothing, i.e. gives you no
results.
On Mac OSX, you'll likely get the following output:
```{zsh, engine.opts="-i"}
which curl
```
```{zsh, eval=FALSE, engine="sh"}
which wget
```
Once you know whether you have `curl` or `wget`, use one of the
following commands to download the file:
**Windows**
```{zsh, eval=FALSE, engine="sh"}
cd ~/shell_data/untrimmed_fastq
wget ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
```
or
**MacOS**
```{zsh, engine.opts="-i"}
cd ~/shell_data/untrimmed_fastq
curl -O ftp://ftp.ensemblgenomes.org/pub/release-37/bacteria/species_EnsemblBacteria.txt
```
Since we wanted to *download* the file rather than just view it, we used `wget` without
any modifiers. With `curl` however, we had to use the -O flag, which simultaneously tells `curl` to
download the page instead of showing it to us **and** specifies that it should save the
file using the same name it had on the server: species_EnsemblBacteria.txt
**Tip:** If you are looking for another (or any really) text file in your home directory to use instead, try:
find ~ -name *.txt