-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathtaxizedb.Rmd
290 lines (233 loc) · 9.57 KB
/
taxizedb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: "taxizedb"
author: Scott Chamberlain
date: "2021-01-14"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{taxizedb}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```r
library("taxizedb")
library("dplyr")
```
## Download DBs
ITIS
```r
db_download_itis()
```
The Plant List (TPL)
```r
db_download_tpl()
```
Catalogue of Life (COL)
```r
db_download_col()
```
## connect to the DBs
By default `src_*` functions use a path to the cached database file.
You can alternatively pass in your own path if you've put it
somewhere else.
ITIS
```r
src_itis <- src_itis()
```
TPL
```r
src_tpl <- src_tpl()
```
COL
```r
src_col <- src_col()
```
## query with SQL syntax
```r
sql_collect(src_itis, "select * from hierarchy limit 5")
#> # A tibble: 5 x 5
#> hierarchy_string tsn parent_tsn level childrencount
#> * <chr> <int> <int> <int> <int>
#> 1 202422 202422 0 0 154282
#> 2 202422-846491 846491 202422 1 2666
#> 3 202422-846491-660046 660046 846491 2 2654
#> 4 202422-846491-660046-846497 846497 660046 3 7
#> 5 202422-846491-660046-846497-846508 846508 846497 4 6
```
```r
# or pipe the src to sql_collect
src_itis %>% sql_collect("select * from hierarchy limit 5")
```
## use dplyr verbs
get a `tbl`
```r
hiers <- src_itis %>% tbl("hierarchy")
#> # Source: table<hierarchy> [?? x 5]
#> # Database: postgres 9.6.0 [sacmac@localhost:5432/ITIS]
#> hierarchy_string tsn parent_tsn level childrencount
#> <chr> <int> <int> <int> <int>
#> 1 202422 202422 0 0 154282
#> 2 202422-846491 846491 202422 1 2666
#> 3 202422-846491-660046 660046 846491 2 2654
#> 4 202422-846491-660046-846497 846497 660046 3 7
#> 5 202422-846491-660046-846497-846508 846508 846497 4 6
#> 6 202422-846491-660046-846497-846508-846553 846553 846508 5 5
#> 7 202422-846491-660046-846497-846508-846553-954935 954935 846553 6 3
#> 8 202422-846491-660046-846497-846508-846553-954935-5549 5549 954935 7 2
#> 9 202422-846491-660046-846497-846508-846553-954935-5549-5550 5550 5549 8 0
#> 10 202422-846491-660046-846497-846508-846553-954936 954936 846553 6 0
#> # ... with more rows
```
select certain fields
```r
hiers %>% select(TSN, level)
#> # Source: lazy query [?? x 2]
#> # Database: postgres 9.6.0 [sacmac@localhost:5432/ITIS]
#> tsn level
#> <int> <int>
#> 1 202422 0
#> 2 846491 1
#> 3 660046 2
#> 4 846497 3
#> 5 846508 4
#> 6 846553 5
#> 7 954935 6
#> 8 5549 7
#> 9 5550 8
#> 10 954936 6
#> # ... with more rows
```
## Local versions of `taxize` functions
A few of the key functions from `taxize` have been ported to `taxizedb`.
Support is currently limited to the NCBI taxonomy database.
`children` accesses the nodes immediately descending from a given taxon
```r
children(3701, db='ncbi')
#> $`3701`
#> childtaxa_id childtaxa_name childtaxa_rank
#> 1 1837063 Arabidopsis thaliana x Arabidopsis halleri species
#> 2 1547872 Arabidopsis umezawana species
#> 3 1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica species
#> 4 1240361 Arabidopsis thaliana x Arabidopsis arenosa species
#> 5 869750 Arabidopsis thaliana x Arabidopsis lyrata species
#> 6 412662 Arabidopsis pedemontana species
#> 7 378006 Arabidopsis arenosa x Arabidopsis thaliana species
#> 8 347883 Arabidopsis arenicola species
#> 9 302551 Arabidopsis petrogena species
#> 10 97980 Arabidopsis croatica species
#> 11 97979 Arabidopsis cebennensis species
#> 12 81970 Arabidopsis halleri species
#> 13 59690 Arabidopsis kamchatica species
#> 14 59689 Arabidopsis lyrata species
#> 15 45251 Arabidopsis neglecta species
#> 16 45249 Arabidopsis suecica species
#> 17 38785 Arabidopsis arenosa species
#> 18 3702 Arabidopsis thaliana species
#>
#> attr(,"class")
#> [1] "children"
#> attr(,"db")
#> [1] "ncbi"
```
`classification` finds the lineage of a taxon
```r
classification(3702, db='ncbi')
#> $`3702`
#> name rank id
#> 1 cellular organisms no rank 131567
#> 2 Eukaryota superkingdom 2759
#> 3 Viridiplantae kingdom 33090
#> 4 Streptophyta phylum 35493
#> 5 Streptophytina subphylum 131221
#> 6 Embryophyta no rank 3193
#> 7 Tracheophyta no rank 58023
#> 8 Euphyllophyta no rank 78536
#> 9 Spermatophyta no rank 58024
#> 10 Magnoliophyta no rank 3398
#> 11 Mesangiospermae no rank 1437183
#> 12 eudicotyledons no rank 71240
#> 13 Gunneridae no rank 91827
#> 14 Pentapetalae no rank 1437201
#> 15 rosids subclass 71275
#> 16 malvids no rank 91836
#> 17 Brassicales order 3699
#> 18 Brassicaceae family 3700
#> 19 Camelineae tribe 980083
#> 20 Arabidopsis genus 3701
#> 21 Arabidopsis thaliana species 3702
#>
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"
```
`downstream` finds all taxa descending from a taxon
```r
downstream(3700, db='ncbi')
#> $`3700`
#> childtaxa_id childtaxa_name rank
#> 1 2071891 Draba taylorii species
#> 2 2071524 Rorippa tenerrima species
#> 3 2071523 Rorippa crystallina species
#> 4 2071509 Physaria calderi species
#> 5 2071468 Erysimum arenicola species
#> 6 2071452 Draba yukonensis species
#> 7 2071451 Draba thompsonii species
#> ...
#> 326 1492251 Erysimum lilacinum species
#> 327 1492250 Erysimum leucanthemum species
#> 328 1492249 Erysimum leptostylum species
#> 329 1492248 Erysimum leptophyllum species
#> 330 1492247 Erysimum leptocarpum species
#> 331 1492246 Erysimum ledebourii species
#> 332 1492245 Erysimum laxiflorum species
#> 333 1492244 Erysimum kurdicum species
#> [ reached getOption("max.print") -- omitted 2880 rows ]
#>
#> attr(,"class")
#> [1] "downstream"
#> attr(,"db")
#> [1] "ncbi"
```
All of these functions run very fast. It only takes a few seconds to find all
bacterial taxa and count them:
```r
downstream(2, db='ncbi')[[1]] %>%
dplyr::group_by(rank) %>%
dplyr::count()
#> #> [1] 138695
#> # A tibble: 18 x 2
#> # Groups: rank [18]
#> rank n
#> <chr> <int>
#> 1 class 83
#> 2 family 483
#> 3 forma 4
#> 4 genus 3497
#> 5 no rank 37140
#> 6 order 198
#> 7 phylum 134
#> 8 species 97031
#> 9 species group 68
#> 10 species subgroup 10
#> 11 subclass 3
#> 12 subfamily 1
#> 13 subgenus 1
#> 14 suborder 8
#> 15 subphylum 1
#> 16 subspecies 10
#> 17 tribe 2
#> 18 varietas 21
```
## Mapping functions
Several mapping functions are available for the NCBI taxonomy database:
```r
# Map scientific or common names to taxonomy IDs
name2taxid("pig")
#> [1] "9823"
# Map taxonomy IDs to scientific names
taxid2name(9823)
#> [1] "Sus scrofa"
# Map taxonomy IDs to rank
taxid2rank(2)
#> [1] "superkingdom"
```