-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-bioconductor.Rmd
387 lines (281 loc) · 15.1 KB
/
03-bioconductor.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
```{r, eval=FALSE}
install.packages("knitr")
```
```{r, echo=FALSE}
library(knitr)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 40), tidy = TRUE)
```
# R Bioconductor
This Rmd file will walk you through the instructions for installing all of the packages you will need for the class. If you have a lot of trouble getting through this file successfully, I would recommend just installing the libraries as you need them. Rmarkdown code chunks for installing libraries with eval set to FALSE will be included for that purpose. R will usually tell you if you don't have a package necessary to run an Rmd file. The Rmd file for this lesson is available [here](https://github.com/gurinina/omic_sciences/blob/main/02-bioconducter.Rmd).
## Why R Bioconductor?
The majority of packages that we will install are part of R Bioconductor. R Bioconductor is based on [R](www.r-project.org).
The Bioconductor project provides access to powerful statistical and graphical methods for the analysis of genomic data. Analysis packages address workflows for analysis of oligonucleotide arrays, sequence analysis, flow cytometry. and other high-throughput genomic data.Three key reasons for using are:
- R is used by many statisticians and biostatisticians to create algorithms that advance our ability to understand complex experimental data.
- R is highly interoperable, and fosters reuse of software components written in other languages.
- R is portable to the key operating systems running on commodity computing equipment (Linux, MacOSX, Windows) and can be used immediately by beginners with access to any of these platforms.
In summary, R's ease-of-use and central role in statistics and "data science" make it a natural choice for a tool-set for use by biologists and statisticians confronting genome-scale experimental data. Since the Bioconductor project's inception in 2001, it has kept pace with growing volumes
and complexity of data emerging in genome-scale biology.
### Functional object-oriented programming
R combines functional and object-oriented programming paradigms.^[[Chambers 2014](https://projecteuclid.org/euclid.ss/1408368569)]
- In functional programming, notation and program activity mimic the
concept of function in mathematics. For example
```{r}
square = function(x) x^2
```
is valid R code that defines the symbol `square` as a function that
computes the second power of its input. The body of the function
is the program code `x^2`, in which `x` is a "free variable".
Once `square` has been defined in this way, `square(3)` has
value `9`.
```{r}
square(3)
```
We say the `square` function has been evaluated on
argument `3`. **In R, all computations proceed by evaluation of functions.**
## Installing Bioconductor
In order to install Bioconductor, copy the following two lines into your R console.
```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
```
This will install the core Bioconductor packages. Further packages can be installed using the `BiocManager::install` function and specifying a character vector of which packages to install.
More information on installing and updating Bioconductor packages can be found at:
http://bioconductor.org/install/
88888888888888888888888888888888888888888888888888888888888
## Required packages
All of the pkgs required for the course are in a file called "_R_package_list.txt".
Again, you want to set eval = TRUE for the next six chunks here on first run:
```{r, eval = FALSE}
pkgs= readLines("_R_package_list.txt")
```
This is a list of all the pkgs you should need.
This next line compares this to all the packages you already have,
and subtracts them into the variable need.
```{r, eval = FALSE}
int <- installed.packages()[, 1]
need <- setdiff(pkgs, int)
git = readLines("_R_package_github.txt")
```
The required packages come in two flavors: those that are available through
Bioconductor, and those that are available through a github website.
I've written these out into two different textfiles: R_packages_github.txt and
_R_package_list.txt. To get the ones available through Bioconductor, we substract
git from need, and these should be all the one we need from Bioconductor,
and this will vary from person to person.
```{r, eval = FALSE}
# installing the R packages required for class after
# installing the latest versions of R, Rstudio
# and Bioconductor
need = setdiff(need,git)
```
Before you tell `BiocManager` to install these packages, check to make sure that there aren't any typos and that the formatting looks
correct.
```{r,eval=FALSE}
need
BiocManager::install(need)
```
Then to install the git packages:
```{r, eval = FALSE}
git = paste0("genomicsclass/",git)
devtools::install_github(git)
```
One final package that needs installation for the course is the following:
```{r, eval = FALSE}
devtools::install_github("gurinina/GOenrichment")
```
All of your needs should be met by need and git, i.e., you should have no "needs"
leftover! The `::` is a shortcut when you want to avoid loading a library, or when
you have a potential function conflict with another library that is loaded.
If that all went successfully you are in great shape! If you want to be ambitious you can try publishing the website for the book yourself.
```{r bookdown, eval = FALSE}
library(bookdown)
bookdown::render_book("index.Rmd")
```
If this process is successful, you can try publishing the book. First you need to create a [bookdown](https://bookdown.org/connect/) account. Then to publish to this account, simply run the following chunk:
```{r publish, eval = FALSE}
publish_book(account = "ggiaever")
```
R will walk you through setting up your account and you will have your very own copy of the course!
88888888888888888888888888888888888888888888888
## Why Bioconductor cont'd
Let's load the packages we need for this lesson.
```{r,echo=FALSE}
library(Biobase)
library(ph525x)
library(png)
```
The broad goals of the Bioconductor project are:
To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
To facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from Entrez genes.
To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.
To further scientific understanding by producing high-quality documentation and reproducible research.
To train researchers on computational and statistical methods for the analysis of genomic data.
Pros
A powerful tool for Bioinformatics
1. free and open software
2. used for over two decades to analyze by genomic data
3. often don't need to write code de novo
4. huge user community (important for learning/help)
5. publication quality graphics
6. reproducible results
7. modularized and standardized
8. gene expression, Chip-seq, scRNA-seq,genetic screens, interaction networks and much much more
Cons
1. steep learning curve
2. no single 'right way' to do anything
3. historically, R was lacking in terms of user interface
### R packages, modularity, continuous integration
**Package structure**
We can perform object-oriented functional programming with R
by writing R code. A basic approach is to create "scripts" that
define all the steps underlying processes of data import and
analysis. When scripts are written in such a way that they
only define functions and data structures, it becomes possible to
*package* them for convenient distribution to other users
confronting similar data management and data analysis problems.
The R software [packaging protocol](https://cran.r-project.org/doc/manuals/r-release/R-exts.html)
specifies how source code in R and other languages can be organized
together with metadata and documentation to foster
convenient testing and redistribution. For example, an early
version of the package defining this document had the folder
layout given below:
```
├── DESCRIPTION (text file with metadata on provenance, licensing)
├── NAMESPACE (text file defining imports and exports)
├── R (folder for R source code)
├── README.md (optional for github face page)
├── data (folder for exemplary data)
├── man (folder for detailed documentation)
├── tests (folder for formal software testing code)
└── vignettes (folder for high-level documentation)
├── biocOv1.Rmd
├── biocOv1.html
```
The packaging protocol document "Writing R Extensions" provides
full details. The R command `R CMD build [foldername]` will operate on the
contents of a package folder to create an archive that can
be added to an R installation using `R CMD INSTALL [archivename]`.
The R studio system performs these tasks with GUI elements.
**Modularity and formal interdependence of packages**
The packaging protocol helps us to isolate software that
performs a limited set of operations, and to
identify the **version** of a program collection
that is inherently changing over time. There is
no objective way to determine whether
a set of operations is the right size for packaging.
Some very useful packages carry out only a small number of
tasks, while others have very broad scope. What is important
is that the package concept permits **modularization** of
software. This is important in two dimensions: scope and time.
Modularization of scope is important to allow parallel independent
development of software tools that address distinct problems.
Modularization in time is important to allow identification of
versions of software whose behavior is stable.
**Continuous integration: testing package correctness and interoperability**
The figure below is a snapshot of the [build report](http://bioconductor.org/checkResults/3.6/bioc-LATEST/) for the development branch of Bioconductor.
```{r lkci,fig=TRUE,echo=FALSE,fig.wide=TRUE, fig.cap="Continuous integration report for Bioconductor devel branch. All packages are checked every night on three major computing platforms."}
buildRep()
```
The six-column subtable in the upper half of the display
includes a column "Installed pkgs", with entry 1857 for
the linux platform. This number varies between platforms
and is generally increasing over time for the devel branch.
### Putting it together
Bioconductor's core developer group works hard to develop
data structures that allow users to work conveniently with
genomes and genome-scale data. Structures are devised to
support the main phases of experimentation in genome scale biology:
- Parse large-scale assay data as produced by microarray or sequencer flow-cell scanners.
- Preprocess the (relatively) raw data to support reliable statistical interpretation.
- Combine assay quantifications with sample-level data to test hypotheses about relationships between molecular processes and organism-level characteristics such as growth, disease state.
In this course we will review the objects and functions that
you can use to perform these and related tasks in your own
research.
**Example**
- In object-oriented programming, a strong focus is placed upon
formalizing data structure, and defining methods that take
advantage of guarantees established through the formalism. This
approach is quite natural but did not get well-established in
practical computer programming until the 1990s. As an
advanced example with Bioconductor, we will consider an
approach to defining an "object" representing on the genome
of *Homo sapiens*:
```{r lkho}
library(Homo.sapiens)
class(Homo.sapiens)
methods(class=class(Homo.sapiens))
```
We say that `Homo.sapiens` is an **instance** of the `OrganismDb`
**class**. Every instance of this class will respond meaningfully
to the methods
listed above. Each method is implemented as an R function.
What the function does depends upon the class of its arguments.
Of special note at this juncture are the methods
`genes`, `exons`, `transcripts` which will yield information about
fundamental components of genomes.
These methods will succeed for human and
for other model organisms such as *Mus musculus*, *S. cerevisiae*,
*C. elegans*, and others for which the Bioconductor project and its
contributors have defined `OrganismDb` representations.
88888888888888888888888888888888888888888888888888888888888
### Finding help
There are many ways to find help directly from within R. Typically, every function will have its own manual page which is accessible by typing a question mark ?, followed by the function name and hitting return.
```{r, eval=TRUE, tidy=FALSE}
?mean
?mad
example(mad)
example(boxplot)
```
The manual page contains a **description**, example **usage**, explanation of all **arguments**, further **details**, explanation of the returned **value**, **references**, **see also** linking to other functions, and **examples**.
If you have a question about a particular object in R, you might want to look up the help for the "class" of that object, which will tell you how to construct such an object and what methods are available for manipulating such objects. For example, we can find the name of the class of an object and look up help:
```{r, tidy=FALSE}
class(6)
?numeric
?"numeric-class"
```
Sometimes, the constructor function and the class name will point to the same help page, although this is not necessarily true for all packages.
```{r, tidy=FALSE}
library(Biobase)
?ExpressionSet
?"ExpressionSet-class"
```
A quick way to find out what methods are available for a given class:
```{r, eval=TRUE}
methods(class="ExpressionSet")
methods(class="lm")
```
Sometimes finding the methods available for a class is so useful that I wrote a couple of functions that I use all the time to access data when I get confused about objects:
```{r}
mypackclass = function(package,class){
int = intersect(sapply(strsplit(as.character(methods(class=class)), ","),
`[`, 1), ls(paste("package",package,sep = ":")))
int
}
mypackmeth = function(package,class){
func = sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes=class,
where=getNamespace(package), printTo=FALSE), value=TRUE))
func
}
```
These take as arguments the name of the package and the class, and output the
methods or functions available to access or act on those objects.
A quick way to look up functions in a given package is to write out the package name, two ":" symbols and then trying tab-completion to get a list of functions, exported or not OR use ls followed by the name of the package:
```{r, eval=FALSE, tidy=FALSE}
library(geneplotter)
geneplotter::
ls("package:geneplotter")
```
### Source code
You can find the source code for many functions by typing out the name of the function with () and pressing enter.
```{r}
mad
```
You might have to specify a particular class if you want source code for a method:
```{r}
library(DESeq2)
plotMA
showMethods("plotMA")
getMethod("plotMA","data.frame")
```
IF typing the package::function doesn't work, the other often more reliable way is to look up the function online by googling the package name, function and the website rdrr.io. This site contains the source function of packages.