forked from NeotomaDB/EPD_binder
-
Notifications
You must be signed in to change notification settings - Fork 2
/
simple_workflow.Rmd
669 lines (491 loc) · 37.9 KB
/
simple_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
---
title: "A Simple Workflow"
author:
- name: "Nora Schlenker"
institute: [uwiscgeog]
correspondence: false
email: [email protected]
orcid_id: 0000-0002-3693-5946
- name: "Simon Goring"
institute: [uwiscgeog,uwiscdsi]
correspondence: true
email: [email protected]
orcid_id: 0000-0002-2700-4605
- name: "Socorro Dominguez Vidaña"
institute: [rhtdata]
correspondence: false
email: [email protected]
orcid_id: 0000-0002-7926-4935
institute:
- htdata:
name: "HT Data"
- uwiscgeog:
name: "University of Wisconsin -- Madison: Department of Geography"
- uwiscdsi:
name: "University of Wisconsin -- Data Science Institute"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: show
fig_caption: yes
keep_md: yes
self_contained: yes
theme: readable
toc: yes
toc_float: yes
css: "text.css"
pdf_document:
pandoc_args: "-V geometry:vmargin=1in -V geometry:hmargin=1in"
csl: 'https://bit.ly/3khj0ZL'
---
```{r setup, echo=FALSE}
options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, DT, readr, stringr, rioja, tidyr)
```
## Introduction
This document is intended to act as a primer for the use of the new Neotoma R package, `neotoma2` and is the companion to the [*Introduction to Neotoma* presentation](https://docs.google.com/presentation/d/1Fwp5yMAvIdgYpiC04xhgV7OQZ-olZIcUiGnLfyLPUt4/edit?usp=sharing). Some users may be working with this document as part of a workshop for which there is a Binder instance. The Binder instance will run RStudio in your browser, with all the required packages installed.
If you are using this workflow on its own, or want to use the package directly, [the `neotoma2` package](https://github.com/NeotomaDB/neotoma2) is available on CRAN by running:
```r
install.packages('neotoma2')
library(neotoma2)
```
Your version should be at or above `r packageVersion("neotoma2")`.
This workshop will also require other packages. To maintain the flow of this document we've placed instructions at the end of the document in the section labelled "[Installing packages on your own](#localinstall)". Please install these packages, and make sure they are at the lastest version.
## Learning Goals
In this tutorial you will learn how to:
1. [Site Searches](#3-site-searches): Search for sites using site names and geographic parameters.
2. [Filter Results](#33-filter-records-tabset): Filter results using temporal and spatial parameters.
3. [Explore Data](#34-pulling-in-sample-data): Obtain sample information for the selected datasets.
4. [Visualize Data](#4-simple-analytics): Perform basic Stratigraphic Plotting
## Background
### Getting Help with Neotoma
If you're planning on working with Neotoma, please join us on [Slack](https://join.slack.com/t/neotomadb/shared_invite/zt-cvsv53ep-wjGeCTkq7IhP6eUNA9NxYQ) where we manage a channel specifically for questions about the R package (the *#it_r* channel, or *#it_r_es* for R help in Spanish and *#it_r_jp* in Japanese). You may also wish to join the Neotoma community through our Google Groups mailing lists; please [see the information on our website](https://www.neotomadb.org/about/join-the-neotoma-community) to be added.
### Understanding Data Structures in Neotoma
Data in the Neotoma database itself is structured as a set of linked relationships to express different elements of paleoecological analysis:
* space and time
* Where is a sample located in latitude and longitude?
* Where is a sample along a depth profile?
* What is the estimated age of that sample?
* What is the recorded age of elements within or adjacent to the sample?
* observations
* What is being counted or measured?
* What units are being used?
* Who observed it?
* scientific methods
* What statistical model was used to calculate age?
* What uncertainty terms are used in describing an observation?
* conceptual data models
* How do observations in one sample relate to other samples within the same collection?
* How does an observation of a fossil relate to extant or extinct relatives?
These relationships can be complex because paleoecology is a broad and evolving discipline. As such, the database itself is highly structured, and normalized, to allow new relationships and facts to be added, while maintaining a stable central data model. If you want to better understand concepts within the database, you can read the [Neotoma Database Manual](https://open.neotomadb.org/manual), or take a look at [the database schema itself](https://open.neotomadb.org/dbschema).
In this workshop we want to highlight two key structural concepts:
1. The way data is structured conceptually within Neotoma (Sites, Collection Units and Datasets).
2. The way that this structure is adapted within the `neotoma2` R package.
#### Data Structure in the Neotoma Database
![**Figure**. *The structure of sites, collection units, samples, and datasets within Neotoma. A site contains one or more collection units. Chronologies are associated with collection units. Samples with data of a common type (pollen, diatoms, vertebrate fauna) are assigned to a dataset.*](images/site_collunit_dataset_rev.png){width=50%}
Data in Neotoma is associated with **sites** -- specific locations with latitude and longitude coordinates.
Within a **site**, there may be one or more [**collection units**](https://open.neotomadb.org/manual/dataset-collection-related-tables-1.html#CollectionUnits) -- locations at which samples are physically collected within the site:
* an archaeological **site** may have one or more **collection units**, pits within a broader dig site
* a pollen sampling **site** on a lake may have multiple **collection units** -- core sites within the lake basin.
* A bog sample **site** may have multiple **collection units** -- a transect of surface samples within the bog.
Collection units may have higher resolution GPS locations than the site location, but are considered to be part of the broader site.
Data within a **collection unit** is collected at various [**analysis units**](https://open.neotomadb.org/manual/sample-related-tables-1.html#AnalysisUnits).
* All sediment at 10cm depth in the depth profile of a cutbank (the collection unit) along an oxbow lake (the site) is one analysis unit.
* All material in a single surface sample (the collection unit) from a bog (the site) is an analysis unit.
* All fossil remains in a buried layer from a bone pile (the collection unit) in a cave (the site) is an analysis unit.
Any data sampled within an analysis unit is grouped by the dataset type (charcoal, diatom, dinoflagellate, etc.) and aggregated into a [**sample**](https://open.neotomadb.org/manual/sample-related-tables-1.html#Samples). The set of samples for a collection unit of a particular dataset type is then assigned to a [**dataset**](https://open.neotomadb.org/manual/dataset-collection-related-tables-1.html#Datasets).
* A sample would be all diatoms (the dataset type) extracted from sediment at 12cm (the analysis unit) in a core (the collection unit) obtained from a lake (the site).
* A sample would be the record of a single mammoth bone (sample and analysis unit, dataset type is vertebrate fauna) embeded in a riverbank (here the site, and collection unit).
#### Data Structures in `neotoma2` {#222-data-structures-in-neotoma2}
![**Figure**. *Neotoma R Package UML diagram. Each box represents a data class within the package. Individual boxes show the class object, its name, its properties, and functions that can be applied to those objects. For example, a `sites` object has a property `sites`, that is a list. The function `plotLeaflet()` can be used on a `sites` object.*](images/neotomaUML_as.svg)
If we look at the [UML diagram](https://en.wikipedia.org/wiki/Unified_Modeling_Language) for the objects in the `neotoma2` R package we can see that the data structure generally mimics the structure within the database itself. As we will see in the [Site Searches section](#3-site-searches), we can search for these objects, and begin to manipulate them (in the [Simple Analysis section](#4-simple-analytics)).
It is important to note: *within the `neotoma2` R package, most objects are `sites` objects, they just contain more or less data*. There are a set of functions that can operate on `sites`. As we add to `sites` objects, using `get_datasets()` or `get_downloads()`, we are able to use more of these helper functions.
## Site Searches
### `get_sites()`
There are several ways to find sites in `neotoma2`, but we think of `sites` as being spatial objects primarily. They have names, locations, and are found within the context of geopolitical units, but within the API and the package, the site itself does not have associated information about taxa, dataset types or ages. It is simply the container into which we add that information. So, when we search for sites we can search by:
| Parameter | Description |
| --------- | ----------- |
| sitename | A valid site name (case insensitive) using `%` as a wildcard. |
| siteid | A unique numeric site id from the Neotoma Database |
| loc | A bounding box vector, geoJSON or WKT string. |
| altmin | Lower altitude bound for sites. |
| altmax | Upper altitude bound for site locations. |
| database | The constituent database from which the records are pulled. |
| datasettype | The kind of dataset (see `get_tables(datasettypes)`) |
| datasetid | Unique numeric dataset identifier in Neotoma |
| doi | A valid dataset DOI in Neotoma |
| gpid | A unique numeric identifier, or text string identifying a geopolitical unit in Neotoma |
| keywords | Unique sample keywords for records in Neotoma. |
| contacts | A name or numeric id for individuals associuated with sites. |
| taxa | Unique numeric identifiers or taxon names associated with sites. |
All sites in Neotoma contain one or more datasets. It's worth noting that the results of these search parameters may be slightly unexpected. For example, searching for sites by sitename, latitude, or altitude will return all of the datasets for the particular site. Searching for terms such as datasettype, datasetid or taxa will return the site, but the only datasets returned will be those matching the dataset-specific search terms. We'll see this later.
#### Site names: `sitename="%ø%"` {.tabset}
We may know exactly what site we're looking for ("Lake Solsø"), or have an approximate guess for the site name (for example, we know it's something like "Solsø", but we're not sure how it was entered specifically), or we may want to search all sites that have a specific term, for example, *ø*.
We use the general format: `get_sites(sitename="%ø%")` for searching by name.
PostgreSQL (and the API) uses the percent sign as a wildcard. So `"%ø%"` would pick up ["Lake Solsø"](https://data.neotomadb.org/4445) for us (and picks up "Isbenttjønn" and "Lake Flåfattjønna"). Note that the search query is also case insensitive.
##### Code
```{r sitename, eval=FALSE}
denmark_sites <- neotoma2::get_sites(sitename = "%ø%")
plotLeaflet(denmark_sites)
```
##### Result
```{r sitenamePlot, echo=FALSE}
denmark_sites <- neotoma2::get_sites(sitename = "%ø%")
plotLeaflet(denmark_sites)
```
#### Location: `loc=c()` {.tabset}
The original `neotoma` package used a bounding box for locations, structured as a vector of latitude and longitude values: `c(xmin, ymin, xmax, ymax)`. The `neotoma2` R package supports both this simple bounding box, but also more complex spatial objects, using the [`sf` package](https://r-spatial.github.io/sf/). Using the `sf` package allows us to more easily work with raster and polygon data in R, and to select sites from more complex spatial objects. The `loc` parameter works with the simple vector, [WKT](https://arthur-e.github.io/Wicket/sandbox-gmaps3.html), [geoJSON](http://geojson.io/#map=2/20.0/0.0) objects and native `sf` objects in R.
As an example of searching for sites using a location, we've created a rough representation of Denmark as a polygon. To work with this spatial object in R we also transformed the `geoJSON` element to an object for the `sf` package. There are many other tools to work with spatial objects in R. Regardless of how you get the data into R, `neotoma2` works with almost all objects in the `sf` package.
```{r boundingBox}
geoJSON <- '{"coordinates": [[
[7.92, 55.02],
[12.42, 54.42],
[12.86, 55.98],
[12.41, 56.44],
[10.63, 57.96],
[7.74, 57.48],
[ 7.70, 57.09],
[7.92, 55.02]
]],
"type": "Polygon"}'
denmark_sf <- geojsonsf::geojson_sf(geoJSON)
# Note here we use the `all_data` flag to capture all the sites within the polygon.
# We're using `all_data` here because we know that the site information is relatively small
# for denmark. If we were working in a new area or with a new search we would limit the
# search size.
denmark_sites <- neotoma2::get_sites(loc = denmark_sf, all_data = TRUE)
```
You can always simply `plot()` the `sites` objects, but you will lose some of the geographic context. The `plotLeaflet()` function returns a `leaflet()` map, and allows you to further customize it, or add additional spatial data (like our original bounding polygon, `sa_sf`, which works directly with the R `leaflet` package):
##### Code
Note the use of the `%>%` pipe here. If you are not familiar with this symbol, check our ["Piping in R" section](#piping-in-r) of the Appendix.
```{r plotL, eval=FALSE}
neotoma2::plotLeaflet(denmark_sites) %>%
leaflet::addPolygons(map = .,
data = denmark_sf,
color = "green")
```
##### Result
```{r plotLeaf, echo=FALSE}
neotoma2::plotLeaflet(denmark_sites) %>%
leaflet::addPolygons(map = .,
data = denmark_sf,
color = "green")
```
#### `site` Object Helpers {.tabset}
If we look at the [data structure diagram](#222-data-structures-in-neotoma2) for the objects in the `neotoma2` R package we can see that there are a set of functions that can operate on `sites`. As we retrieve more information for `sites` objects, using `get_datasets()` or `get_downloads()`, we are able to use more of these helper functions.
As it is, we can take advantage of functions like `summary()` to get a more complete sense of the types of data we have in `denmark_sites`. The following code gives the summary table. We do some R magic here to change the way the data is displayed (turning it into a [`DT::datatable()`](https://rstudio.github.io/DT/) object), but the main piece is the `summary()` call.
##### Code
```{r summary_sites, eval=FALSE}
# Give information about the sites themselves, site names &cetera.
neotoma2::summary(denmark_sites)
# Give the unique identifiers for sites, collection units and datasets found at those sites.
neotoma2::getids(denmark_sites)
```
##### Result
```{r summarySitesTable, eval=TRUE, echo=FALSE}
neotoma2::summary(denmark_sites) %>%
DT::datatable(data = ., rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
In this document we list only the first 10 records (there are more, you can use `length(datasets(denmark_sites))` to see how many datasets you've got). We can see that there are no chronologies associated with the `site` objects. This is because, at present, we have not pulled in the `dataset` information we need. In Neotoma, a chronology is associated with a collection unit (and that metadata is pulled by `get_datasets()` or `get_downloads()`). All we know from `get_sites()` are the kinds of datasets we have and the location of the sites that contain the datasets.
### `get_datasets()` {.tabset}
Within Neotoma, collection units and datasets are contained within sites. Similarly, a `sites` object contains `collectionunits` which contain `datasets`. From the table above (Result tab in Section 3.1.3.2) we can see that some of the sites we've looked at contain pollen records, some contain geochronologic data and some contain other dataset types. We could write something like this: `table(summary(denmark_sites)$types)` to see the different datasettypes and their counts.
With a `sites` object we can directly call `get_datasets()` to pull in more metadata about the datasets. The `get_datasets()` method also supports any of the search terms listed above in the [Site Search](#3-site-searches) section. At any time we can use `datasets()` to get more information about any datasets that a `sites` object may contain. Compare the output of `datasets(denmark_sites)` to the output of a similar call using the following:
#### Code
```{r datasetsFromSites, eval=FALSE}
# This may be slow, because there's a lot of sites!
# denmark_datasets <- neotoma2::get_datasets(denmark_sites, all_data = TRUE)
denmark_datasets <- neotoma2::get_datasets(loc = denmark_sf, datasettype = "pollen", all_data = TRUE)
datasets(denmark_datasets)
```
#### Result
```{r datasetsFromSitesResult, echo=FALSE, message=FALSE}
denmark_datasets <- neotoma2::get_datasets(loc = denmark_sf, datasettype = "pollen", all_data = TRUE)
datasets(denmark_datasets) %>%
as.data.frame() %>%
DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
You can see that this provides information only about the specific dataset, not the site! For a more complete record we can join site information from `summary()` to dataset information using `datasets()` using the `getids()` function which links sites, and all the collection units and datasets they contain.
### `filter()` Records {.tabset}
If we choose to pull in information about only a single dataset type, or if there is additional filtering we want to do before we download the data, we can use the `filter()` function. For example, if we only want sedimentary pollen records (as opposed to pollen surface samples), and want records with known chronologies, we can filter by `datasettype` and by the presence of an `age_range_young`, which would indicate that there is a chronology that defines bounds for ages within the record.
#### Code
```{r downloads, eval=FALSE}
denmark_records <- denmark_datasets %>%
neotoma2::filter(!is.na(age_range_young))
neotoma2::summary(denmark_records)
# We've removed records, so the new object should be shorter than the original.
length(denmark_records) < length(denmark_datasets)
```
#### Result
```{r downloadsCode, echo = FALSE}
denmark_records <- denmark_datasets %>%
neotoma2::filter(!is.na(age_range_young))
neotoma2::summary(denmark_records) %>% DT::datatable(data = .,
options = list(scrollX = "100%", dom = 't'))
```
We can see now that the data table looks different (comparing it to the [table above](#322-result)), and there are fewer total sites. Again, there is no explicit chronology for these records, we need to pull down the complete download for these records, but we begin to get a sense of what kind of data we have.
### Pulling in `sample()` data
Because sample data adds a lot of overhead (for this pollen data, the object that includes the dataset with samples is 20 times larger than the `dataset` alone), we try to call `get_downloads()` only after we've done our preliminary filtering. After `get_datasets()` you have enough information to filter based on location, time bounds and dataset type. When we move to `get_download()` we can do more fine-tuned filtering at the analysis unit or taxon level.
The following call can take some time, but we've frozen the object as an RDS data file. You can run this command on your own, and let it run for a bit, or you can just load the object in.
```{r taxa}
## This line is commented out because we've already run it for you.
## denmark_dl <- denmark_records %>% get_downloads(all_data = TRUE)
## saveRDS(denmark_dl, "data/dkDownload.RDS")
denmark_dl <- readRDS("data/dkDownload.RDS")
```
Once we've downloaded, we now have information for each site about all the associated collection units, the datasets, and, for each dataset, all the samples associated with the datasets. To extract samples all downloads we can call:
```{r allSamples}
allSamp <- samples(denmark_dl)
```
When we've done this, we get a `data.frame` that is `r nrow(allSamp)` rows long and `r ncol(allSamp)` columns wide. The reason the table is so wide is that we are returning data in a **long** format. Each row contains all the information you should need to properly interpret it:
```{r colNamesAllSamp, echo = FALSE}
colnames(allSamp)
```
For some dataset types or analyses, some of these columns may not be needed, however, for other dataset types they may be critically important. To allow the `neotoma2` package to be as useful as possible for the community we've included as many as we can.
#### Extracting Taxa {.tabset}
If you want to know what taxa we have in the record you can use the helper function `taxa()` on the sites object. The `taxa()` function gives us not only the unique taxa, but two additional columns -- `sites` and `samples` -- that tell us how many sites the taxa appear in, and how many samples the taxa appear in, to help us better understand how common individual taxa are.
##### Code
```{r taxa2, eval=FALSE}
neotomatx <- neotoma2::taxa(denmark_dl)
```
##### Results
```{r taxaprint, echo=FALSE, message=FALSE}
neotomatx <- neotoma2::taxa(denmark_dl)
neotomatx %>%
DT::datatable(data = head(neotomatx, n = 20), rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
#### Understanding Taxonomies in Neotoma {-}
Taxonomies in Neotoma are not as straightforward as we might expect. Taxonomic identification in paleoecology can be complex, impacted by the morphology of the object we are trying to identify, the condition of the palynomorph, the expertise of the analyst, and many other conditions. You can read more about concepts of taxonomy within Neotoma in the Neotoma Manual's [section on Taxonomic concepts](https://open.neotomadb.org/manual/database-design-concepts.html#taxonomy-and-synonymy).
We use the unique identifiers (*e.g.*, `taxonid`, `siteid`, `analysisunitid`) throughout the package, since they help us to link between records. The `taxonid` values returned by the `taxa()` call can be linked to the `taxonid` column in the `samples()` table. This allows us to build taxon harmonization tables if we choose to. You may also note that the `taxonname` is in the field `variablename`. Individual sample counts are reported in Neotoma as [`variables`](https://open.neotomadb.org/manual/taxonomy-related-tables-1.html#Variables). A "variable" may be either a species, something like laboratory measurements, or a non-organic proxy, like charcoal or XRF measurements, and includes the units of measurement and the value.
#### Simple Harmonization {.tabset}
Let's say we want all samples from which *Poaceae* (grass) taxa have been reported to be grouped together into one pseudo-taxon called *Poaceae-undiff*. **NOTE**, this is may not be an ecologically useful grouping, but is used here for illustration.
There are several ways of grouping taxa, either directly by exporting the file and editing each individual cell, or by creating an external "harmonization" table (which we did in the prior `neotoma` package). First, lets look for how many different ways *Poaceae* appears in these records. We can use the function `str_detect()` from the `stringr` package to look for patterns, and then return either `TRUE` or `FALSE` when the string is detected:
```{r echo = FALSE}
# How many different identifications have "Poaceae" been given in these records?
neotomatx %>%
filter(stringr::str_detect(variablename, "Poaceae"))
```
We can harmonize taxon by taxon a number of different ways. One way would be to get every instance of a *Poaceae* taxon and just change them directly. Here we are taking the column `variablename` from the `allSamp` object (this is where the count data is). The square brackets are telling us which rows we're changing, here only rows where we detect `"Poaceae"` in the variable name. For each of those rows, in that column, we assign the value `"Poaceae undiff"`:
```{r echo = FALSE, eval=FALSE}
# Don't run this!
allSamp$variablename[stringr::str_detect(allSamp$variablename, "Poaceae")] <- "Poaceae undiff."
```
There were originally `r sum(stringr::str_detect(neotomatx$variablename, "Poaceae.*"))` different taxa identified as being within the genus *Poaceae* (including *Poaceae*., *Poaceae (>40 µm)*, and *Poaceae undiff. (<40 µm)*). The above code reduces them all to a single taxonomic group *Poaceae undiff*.
Note that this changes *Poaceae* in the `allSamp` object _only_, not in any of the downloaded objects. If we were to call `samples()` again, the taxonomy would return to its original form.
A second way to harmonize taxa is to use an external table, which is especially useful if we want to have an artifact of our choices. For example, a table of pairs (what we want changed, and the name we want it replaced with) can be generated, and it can include regular expressions (if we choose):
| original | replacement |
| -------- | ----------- |
| Poaceae.* | Poaceae-undiff |
| Picea.* | Picea-undiff |
| Plantago.* | Plantago-undiff |
| Quercus.* | Quercus-undiff |
| ... | ... |
We can get the list of original names directly from the `taxa()` call, applied to a `sites` object that contains samples, and then export it using `write.csv()`.
```{r countbySitesSamples, eval=FALSE}
taxaplots <- taxa(denmark_dl)
# Save the taxon list to file so we can edit it subsequently.
readr::write_csv(taxaplots, "data/mytaxontable.csv")
```
#### Looking at the Taxonomic Structure {.tabset}
The `taxa` function returns all our taxonomic information, and it provides some additional information, the columns `samples` and `sites` which record the number of samples across all datasets that contain the taxon, and the number of sites with the taxon. The plot below shows the relationship between samples and sites, which we would expect to be somewhat skewed, as it is.
This is effectively a rarefaction curve, the more sites a taxon is found at, the more samples it is found at.
##### Code
```{r PlotTaxonCountsFirst, fig.cap="**Figure**. *A plot of the number of sites a taxon appears in, against the number of samples a taxon appears in.*", eval=FALSE}
taxaplots <- taxa(denmark_dl)
ggplot(data = taxaplots, aes(x = sites, y = samples)) +
geom_point() +
stat_smooth(method = 'glm',
method.args = list(family = 'poisson')) +
xlab("Number of Sites") +
ylab("Number of Samples") +
theme_bw()
```
##### Result
```{r PlotTaxonCounts, echo=FALSE, fig.cap="**Figure**. *A plot of the number of sites a taxon appears in, against the number of samples a taxon appears in.*", message=FALSE}
taxaplots <- taxa(denmark_dl)
ggplot(data = taxaplots, aes(x = sites, y = samples)) +
geom_point() +
stat_smooth(method = 'glm',
method.args = list(family = 'poisson')) +
xlab("Number of Sites") +
ylab("Number of Samples") +
theme_bw()
```
#### Editing the Taxonomy Table {-}
The plot (above) is mostly for illustration, but we can see, as a sanity check, that the relationship is as we'd expect. Here, each point represents a separate taxon, roughly, so there is a large density of points (taxa) that plot in the lower left section of the figure, and fewer points in the upper right. This means that there are a large number of taxa that are rarely present and then several that are quite common.
Exporting the taxon table to a `csv` file allows us to edit the table, filtering and selecting taxa based on contextual information, such as the `ecologicalgroup` or `taxongroup` to help you out. Once you've cleaned up the translation table you can load it in (try to save it under a different file name!), and then apply the transformation:
```{r translationTable, message=FALSE, eval=FALSE}
translation <- readr::read_csv("data/taxontable.csv")
```
I did a bunch of work here. . . Then we read it in.
```{r translationDisplay, message=FALSE, echo = FALSE}
translation <- readr::read_csv("data/taxontable.csv")
DT::datatable(translation, rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
You can see we've changed some of the taxon names in the taxon table. To replace the names in the `samples()` output, we'll join the two tables using an `inner_join()` (meaning the `variablename` must appear in both tables for the result to be included), and then we're going to select only those elements of the sample tables that are relevant to our later analysis, using the `harmonizedname` column as our new name for the taxa:
```{r joinTranslation, eval = FALSE}
allSamp <- samples(denmark_dl)
allSamp <- allSamp %>%
inner_join(translation, by = c("variablename" = "variablename")) %>%
dplyr::select(!c("variablename")) %>%
group_by(siteid, sitename, harmonizedname,
sampleid, units, age,
agetype, depth, datasetid,
long, lat) %>%
summarise(value = sum(value), .groups='keep')
```
```{r harmonizationTableOut, message = FALSE, echo=FALSE}
cleanSamp <- samples(denmark_dl) %>%
inner_join(translation, by = c("variablename" = "variablename")) %>%
dplyr::select(!c("variablename")) %>%
group_by(siteid, sitename, harmonizedname,
sampleid, units, age,
agetype, depth, datasetid,
long, lat) %>%
summarise(value = sum(value), .groups='keep') %>%
arrange(sitename, age, harmonizedname)
DT::datatable(head(cleanSamp, n = 50), rownames = FALSE,
options = list(scrollX = "100%", dom = 't'))
```
We now have a cleaner set of taxon names compared to the original table, both because of harmonization, and because we cleared out many of the non-**TRSH** (trees and shrubs) taxa from the harmonization table. Plotting the same set of taxa with the new harmonized names results in this plot:
```{r origTableOut, message = FALSE, echo=FALSE, fig.cap="**Figure**. *The same site/sample plot as above, with with the new harmonized taxonomy. Note that the distribution of points along the curve is smoother, as we remove some of the taxonomic issues.*"}
taxaplots <- samples(denmark_dl) %>%
inner_join(translation, by = c("variablename" = "variablename")) %>%
dplyr::select(!c("variablename")) %>%
group_by(harmonizedname) %>%
summarise(sites = length(unique(siteid)), samples = length(unique(sampleid)), .groups='keep')
ggplot(data = taxaplots, aes(x = sites, y = samples)) +
geom_point() +
stat_smooth(method = 'glm',
method.args = list(family = 'poisson')) +
xlab("Number of Sites") +
ylab("Number of Samples") +
theme_bw()
```
## Simple Analytics
### Stratigraphic Plotting {.tabset}
To plot at strategraphic diargram we are only interested in one site and in one dataset. By looking at the summary of downloads we can see that Lake Solsø has two collection units that both have a pollen record. Lets look at the SOLSOE81 collection unit, which is the second download. To get the samples from just that one collection unit by specifying that you want only the samples from the second download.
We can use packages like `rioja` to do stratigraphic plotting for a single record, but first we need to do some different data management. Although we could do harmonization again we're going to simply take the taxa at a single site and plot them in a stratigraphic diagram. However, if you would like to plot multiple sites and you want them to have harmonized taxa we have provided examples on how to do both.
#### Raw Taxon
```{r stratiplotraw, message = FALSE}
# Get a particular site, in this case we are simply subsetting the
# `denmark_dl` object to get Lake Solsø:
plottingSite <- denmark_dl[[2]]
# Select only pollen measured using NISP and convert to a "wide"
# table, using proportions. The first column will be "age".
# This turns our "long" table into a "wide" table:
counts <- plottingSite %>%
samples() %>%
toWide(ecologicalgroup = c("TRSH"),
unit = c("NISP"),
elementtypes = c("pollen"),
groupby = "age",
operation = "prop") %>%
arrange(age)
counts <- counts[, colSums(counts > 0.01, na.rm = TRUE) > 5]
```
#### With Harmonization
```{r stratiplotharm, message = FALSE}
# Get a particular site, in this case we are simply subsetting the
# `denmark_dl` object to get Lake Solsø:
plottingSite <- denmark_dl[[2]]
# Select only pollen measured using NISP and convert to a "wide"
# table, using proportions. The first column will be "age".
# This turns our "long" table into a "wide" table:
counts_harmonized <- plottingSite %>%
samples() %>%
toWide(ecologicalgroup = c("TRSH"),
unit = c("NISP"),
elementtypes = c("pollen"),
groupby = "age",
operation = "prop") %>%
arrange(age) %>%
pivot_longer(-age) %>%
inner_join(translation, by = c("name" = "variablename")) %>%
dplyr::select(!c("name", taxonid)) %>%
group_by(harmonizedname, age) %>%
summarise(value = sum(value), .groups='keep')%>%
pivot_wider(names_from = harmonizedname, values_from = value)
counts_harmonized <- counts_harmonized[, colSums(counts_harmonized > 0.01, na.rm = TRUE) > 5]
```
### {.tabset}
Hopefully the code is pretty straightforward. The `toWide()` function provides you with significant control over the taxa, units and other elements of your data before you get them into the wide matrix (`depth` by `taxon`) that most statistical tools such as the `vegan` package or `rioja` use.
To plot the data we can use `rioja`'s `strat.plot()`, sorting the taxa using weighted averaging scores (`wa.order`). I've also added a CONISS plot to the edge of the the plot, to show how the new *wide* data frame works with distance metric funcitons.
#### Raw Taxon
```{r plotStrigraphraw, message=FALSE, warning=FALSE, out.width='90%'}
# Perform constrained clustering:
clust <- rioja::chclust(dist(sqrt(counts)),
method = "coniss")
# Plot the stratigraphic plot, converting proportions to percentages:
plot <- rioja::strat.plot(counts[,-1] * 100, yvar = counts$age,
title = denmark_dl[[1]]$sitename,
ylabel = "Calibrated Years BP",
xlabel = "Pollen (% of Trees and Shrubs)",
srt.xlabel = 70,
y.rev = TRUE,
clust = clust,
wa.order = "topleft",
scale.percent = TRUE)
rioja::addClustZone(plot, clust, 4, col = "red")
```
#### With Harmonization
```{r plotStrigraphharm, message=FALSE, warning=FALSE, out.width='90%'}
# Perform constrained clustering:
clust <- rioja::chclust(dist(sqrt(counts_harmonized)),
method = "coniss")
# Plot the stratigraphic plot, converting proportions to percentages:
plot <- rioja::strat.plot(counts_harmonized[,-1] * 100, yvar = counts_harmonized$age,
title = denmark_dl[[1]]$sitename,
ylabel = "Calibrated Years BP",
xlabel = "Pollen (% of Trees and Shrubs)",
srt.xlabel = 70,
y.rev = TRUE,
clust = clust,
wa.order = "topleft",
scale.percent = TRUE)
rioja::addClustZone(plot, clust, 4, col = "red")
```
###
## Conclusion
So, we've done a lot in this example. We've (1) searched for sites using site names and geographic parameters, (2) filtered results using temporal and spatial parameters, (3) obtained sample information for the selected datasets and (4) performed basic analysis including the use of climate data from rasters. Hopefully you can use these examples as templates for your own future work, or as a building block for something new and cool!
## Appendix Sections
### Installing packages on your own {#localinstall}
We use several packages in this document, including `leaflet`, `sf` and others. We load the packages using the `pacman` package, which will automatically install the packages if they do not currently exist in your set of packages.
```{r setupFake, eval=FALSE}
options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, terra, DT, readr, stringr, rioja)
```
Note that R is sensitive to the order in which packages are loaded. Using `neotoma2::` tells R explicitly that you want to use the `neotoma2` package to run a particular function. So, for a function like `filter()`, which exists in other packages such as `dplyr`, you may see an error that looks like:
```bash
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "sites"
```
In that case it's likely that the wrong package is trying to run `filter()`, and so explicitly adding `dplyr::` or `neotoma2::` in front of the function name (i.e., `neotoma2::filter()`)is good practice.
### Piping in `R` {.tabset}
Piping is a technique that simplifies the process of chaining multiple operations on a data object. It involves using either of these operators: `|>` or `%>%`. `|>` is a base R operator while `%>%` comes from the `tidyverse` ecosystem in R. In `neotoma2` we use `%>%`.
The pipe operator works as a real-life pipe, which carries water from one location to another. In programming, the output of the function on the left-hand side of the pipe is taken as the initial argument for the function on the right-hand side of the pipe. It helps by making code easier to write and read. Additionally, it reduces the number of intermediate objects created during data processing, which can make code more memory-efficient and faster.
Without using pipes you can use the `neotoma2` R package to retrieve a site and then plot it by doing:
```r
# Retrieve the site
plot_site <- neotoma2::get_sites(sitename = "%ø%")
# Plot the site
neotoma2::plotLeaflet(object = plot_site)
```
This would create a variable `plot_site` that we will not need any more, but it was necessary so that we could pass it to the `plotLeaflet` function.
With the pipe (`%>%`) we do not need to create the variable, we can just rewrite our code. Notice that `plotLeaflet()` doesn't need the `object` argument because the response of `get_sites(sitename = "%ø%")` gets passed directly into the function.
#### 2.2.3.1. Code
```{r piping code, eval=FALSE}
# get_sites and pipe. The `object` parameter for plotLeaflet will be the
# result of the `get_sites()` function.
get_sites(sitename = "%ø%") %>%
plotLeaflet()
```
#### 2.2.3.2. Result
```{r piping result, echo=FALSE}
# get_sites and pipe
get_sites(sitename = "%ø%") %>%
plotLeaflet()
```