-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathPedigree_verification.Rmd
486 lines (360 loc) · 15.5 KB
/
Pedigree_verification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "Selecting markers for pedigree verification"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
Urbana-Champaign
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```
In this workflow, we'll identify markers useful for identifying accessions and
verifying pedigrees, and export those markers for KASP marker design.
## Setup
Be sure you have completed the [setup](Setup.md) steps before working through
this pipeline. First we'll load all the needed packages and functions.
```{r libs, message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(adegenet)
library(ggplot2)
library(ape)
library(dplyr)
source("src/marker_stats.R")
source("src/getNumGeno.R")
source("src/evaluate_marker_sets.R")
```
Next we'll point to our VCF and reference genome files. Edit these lines to
point to your dataset.
```{r files}
bg <- "data/331_new_data_vcf_IBRC.vcf.bgz"
rds <- "data/yam336.m2M2vsnps_missing0.9.recode.rds" # unfiltered markers; optional
refgenome <- FaFile("data/TDr96_F1_v2_PseudoChromosome.rev07.fasta")
```
## Population statistics
We'll import all the genotypes in numeric format, indicating number of copies
of the alternative allele.
```{r numeric_gen}
numgen <- getNumGeno(bg)
numgen[1:10,1:10]
```
### Check for clonal duplicates
It is good to check that there aren't any clonal duplicates in the dataset, as
these could bias downstream analysis. We'll calculate identity-by-state
and look at its distribution.
```{r echo = FALSE}
set.seed(1019) # to make report reproducible
```
```{r inddist}
myIBS <- interIndividualIBS(numgen)
hist(myIBS, xlab = "Identity-by-state", main = "Histogram of IBS")
```
```{r njtree, fig.height = 6, fig.width = 6}
plot(nj(1 - myIBS), type = "unrooted", show.tip.label = FALSE)
```
Here it looks like the most closely related individuals are family members, not
clones. If there were some values closer to one, however, clones could be
removed using the `removeClones` function.
```{r removeclones}
numgen <- removeClones(numgen, myIBS)
```
### Observed and expected heterozygosity
Let's look at the distribution of allele frequencies in this dataset.
```{r alfreq}
alfreq <- rowMeans(numgen, na.rm = TRUE) / 2
hist(alfreq, xlab = "Allele frequency", main = "Histogram of alternative allele frequency")
```
We can look at the relationship between observed and expected heterozygosity
( _Ho_ and _He_, respectively).
```{r hohe}
He <- Expected_het(alfreq)
Ho <- rowMeans(numgen == 1, na.rm = TRUE)
hist(Ho / He)
```
In this case the mode is a little above 1, suggesting the presence of hybrid
lines and/or polyploids in the dataset. The bump at 2 indicates likely paralogs.
In this case we'll set a cutoff at 1.5.
```{r hohe_vs_alfreq}
hohe_cutoff <- 1.5
ggplot(mapping = aes(x = alfreq, y = Ho/He)) +
geom_point(alpha = 0.05) +
geom_hline(yintercept = hohe_cutoff, color = "blue") +
labs(x = "Allele frequency") +
ggtitle("Filtering for putative paralogs")
```
We can see what proportion of markers we would lose at this cutoff. It's about 11%.
```{r mean_hohe}
mean(Ho/He > hohe_cutoff)
```
We'll do that filtering now, then start a data frame to track statistics about each SNP.
```{r filter_hohe}
keep1 <- which(Ho/He < hohe_cutoff)
numgen <- numgen[keep1,]
snpdf <- data.frame(row.names = rownames(numgen),
AlFreq = alfreq[keep1],
Ho = Ho[keep1],
He = He[keep1])
```
In this case there are no missing data, but I am going to add that calculation
in for the sake of reusing the workflow.
```{r missrate}
snpdf$MissingRate <- rowMeans(is.na(numgen))
summary(snpdf)
```
We can also look at the heterozygosity across individuals. The two outlier
individuals we see below are likely to be polyploids or inter-species hybrids.
We'll filter them out.
```{r filter_ind}
hetByInd <- colMeans(numgen == 1, na.rm = TRUE)
hist(hetByInd)
numgen <- numgen[, hetByInd < 0.5]
```
## Genetic groups
We could simply choose markers with high allele frequency at this point, but
instead I am going to identify genetic groups first. That way, I can test the
allele frequency within each genetic group. We generally want a variety of
markers so that we have some that are variable within each group, and some
that can distinguish groups. We'll use the Discriminant Analysis of Principal
Components (DAPC) protocol implemented in the `adegenet` package. In order
to knit this document I have to set `n.pca` and `n.da` manually, but for
your work I recommend a first pass where they are left as `NULL` so that you
can chose them based on where the graphs plateau.
```{r echo = FALSE}
set.seed(1016) # to ensure report reproducibility
```
```{r findclusters, eval = FALSE}
myclust <- find.clusters(t(numgen), n.pca = 100)
```
```{r loadclust, echo = FALSE}
load("results/myclust.RData")
plot(myclust$Kstat, xlab = "Number of clusters", ylab = "BIC",
main = "Value of BIC vs. number of clusters", type = "o", col = "blue")
```
Based on the BIC graph, we'll select ten clusters.
Check the size of the groups. If it is very uneven, you may want fewer groups.
```{r clustsize}
myclust$size
```
Now perform DAPC.
```{r dapc}
mydapc <- dapc(t(numgen), grp = myclust$grp, scale = TRUE, n.pca = 50, n.da = 2)
```
```{r dapcscatter}
scatter(mydapc, posi.da = "topright")
```
Group assignments are in `mydapc$assign`. You many want to inspect some of the
more differentiated groups to see if they are biologically meaningful.
```{r checkgroups}
colnames(numgen)[mydapc$assign == 9]
colnames(numgen)[mydapc$assign == 4]
```
Now we can get allele frequencies within each genetic group.
```{r freq_by_group}
grp_alfreq <- sapply(levels(mydapc$assign),
function(x) rowMeans(numgen[, mydapc$assign == x], na.rm = TRUE) / 2)
colnames(grp_alfreq) <- paste0("AlFreq_Grp", colnames(grp_alfreq))
head(grp_alfreq)
```
## Technical parameters
For this portion of the workflow, we'll import SNP metadata using the
`VariantAnnotation` package. If you are just working with one VCF,
use the following code:
```{r importvcf, eval = FALSE}
myvcf <- readVcf(bg,
param = ScanVcfParam(geno = NA))
```
Otherwise, if you have a larger VCF, with a bigger set of markers,
read it from the RDS made during setup:
```{r imporvcf2}
myvcf <- readRDS(rds)
rowRanges(myvcf)
```
Since we have quality scores, we will look at the distribution.
```{r qualhist}
hist(rowRanges(myvcf)$QUAL, xlab = "Quality score",
main = "Histogram of quality scores in large VCF")
```
This suggests filtering to only keep the highest scores is advisable.
We will also make sure to keep any SNPs that were in our smaller VCF.
```{r filtervcf}
temp <- paste(seqnames(myvcf), start(myvcf), sep = "_")
myvcf <- myvcf[rowRanges(myvcf)$QUAL > 900 |
temp %in% rownames(snpdf),]
```
### GC content
PCR tends to work best in regions that are 40-60% GC. We will test
the GC content for the 50-bp flanking region for each SNP in our
table.
```{r gccontent}
snpdf$GCcontent <- gcContent(myvcf, rownames(snpdf), refgenome)
hist(snpdf$GCcontent, xlab = "GC content", main = "Histogram of GC content")
mean(snpdf$GCcontent >= 0.4 & snpdf$GCcontent <= 0.6)
```
About 36.5% of markers are within the target range for GC content.
We can subset to just look at these.
```{r gcsubset}
snpdf2 <- filter(snpdf, GCcontent >= 0.4, GCcontent <= 0.6)
```
### Number of flanking SNPs
Although we'll annotate flanking SNPs, the fewer there are, the less likely
the marker will have technical problems.
```{r nflank}
snpdf2$Nflanking <- nFlankingSNPs(myvcf, rownames(snpdf2))
hist(snpdf2$Nflanking, xlab = "Number of flanking SNPs",
main = "Histogram of number of flanking SNPs")
table(snpdf2$Nflanking)
```
The cutoff is arbitrary, but let's keep markers with two or fewer flanking SNPs.
```{r flankfilt}
snpdf3 <- filter(snpdf2, Nflanking <= 2)
```
## Choosing a set of markers
Since we are selecting a small set of markers, we want to not only make sure
that each individual marker has a high minor allele frequency to maximize its
information content, but also that the markers complement each other well so
that they can distinguish individuals and genetic groups.
Two approaches are presented here. One is a Galaxy tool developed by Carlos
Ignacio for finding sets of markers that distinguish all accessions. The
second is an algorithm I developed to maximize diversity captured within and
between genetic groups.
### Galaxy tool
First, we have to export our set of suitable markers back to VCF. Some conversion
of marker and chromosome names is done here in order to get everything to match.
```{r subvcf, eval = FALSE}
rr3 <- rowRanges(myvcf)[match(rownames(snpdf3),
paste(seqnames(myvcf), start(myvcf), sep = "_"))]
rr3a <- GRanges(sub("chrom", "OM", seqnames(rr3)), ranges(rr3))
subvcf <- readVcf(bg, genome = seqinfo(rr3a),
param = ScanVcfParam(which = rr3a))
writeVcf(subvcf, filename = "results/marker_subset.vcf")
```
Now, on Galaxy, use the
[Purity tool](http://cropgalaxy.excellenceinbreeding.org/root?tool_id=purity_beta)
on that VCF file. I set it to select 50 markers after considering 10,000
solutions, and left the distance cutoff at 0.05. Two files were output, and I
downloaded the first to my `results` folder. Below I will import it to get
the marker set.
```{r}
markers_purity <- read.delim("results/Galaxy9-[Purity_(beta)_on_data_8].tabular")$Name[1:50]
sort(markers_purity)
```
### Simulated annealing algorithm
In the file [evaluate_marker_sets.R](src/evaluate_marker_sets.R) there is
a function to use a
[simulated annealing algorithm](https://en.wikipedia.org/wiki/Simulated_annealing)
to find optimal sets of markers that capture diversity within and between
genetic groups.
```{r echo = FALSE}
set.seed(1020) # for report reproducibility
```
```{r choosemarkers, message = FALSE}
markers_simanneal <- findMarkerSet(grp_alfreq[rownames(snpdf3),], nSNP = 50)$Set
sort(markers_simanneal)
```
### Comparing the results
How well do the markers distinguish individuals in the dataset?
```{r markerset_dist}
ibs_purity <- interIndividualIBS(numgen[markers_purity,])
hist(ibs_purity, xlab = "Identity-by-state",
main = "Histogram of IBS using Purity marker subset")
mean(ibs_purity == 1)
propUnique(numgen[markers_purity,])
ibs_simanneal <- interIndividualIBS(numgen[markers_simanneal,])
hist(ibs_simanneal, xlab = "Identity-by-state",
main = "Histogram of IBS using simulated annealing marker subset")
mean(ibs_simanneal == 1)
propUnique(numgen[markers_simanneal,])
```
If two individuals were selected at random, there is a 0.2% chance that they
would have identical genotypes across all markers using the simulated annealing
algorithm, as opposed to a 0.01% chance using markers from the Purity algorithm.
This corresponds to 75% of the individuals being unique with the simulated
annealing set and 97% of individuals being unique with the Purity set.
So, the Purity algorithm is somewhat better for distinguishing individuals.
We can also compare the geometric mean of expected heterozygosity in all of the
populations.
```{r markerset_div}
DivScore(grp_alfreq[markers_purity,])
DivScore(grp_alfreq[markers_simanneal,])
```
The simulated annealing algorithm gives higher expected heterozygosity.
We can also see the geometric mean of Jost's D, a differentiation statistic
among populations.
```{r markerset_diff}
DiffScore(grp_alfreq[markers_purity,])
DiffScore(grp_alfreq[markers_simanneal,])
```
The simulated annealing algorithm gives higher differentiation among populations.
## Exporting markers
Now we can get the flanking sequences for the markers that we selected.
```{r formatkasp}
markerseq1 <- formatKasp(myvcf, markers_purity, refgenome)
markerseq2 <- formatKasp(myvcf, markers_simanneal, refgenome)
head(markerseq1)
```
These can be exported to a spreadsheet.
```{r export}
write.csv(markerseq1, file = "results/pedigree_verification_markers_Purity.csv",
row.names = FALSE)
write.csv(markerseq2, file = "results/pedigree_verification_markers_simanneal.csv",
row.names = FALSE)
```
## Curve to determine optimal number of markers
An additional analysis was performed where the number of markers in the dataset was
varied, in order to determine the optimal number of markers.
```{r}
markers_purity10 <- read.delim("results/Galaxy11-[Purity_(beta)_on_data_8].tabular")$Name[1:10]
markers_purity20 <- read.delim("results/Galaxy13-[Purity_(beta)_on_data_8].tabular")$Name[1:20]
markers_purity30 <- read.delim("results/Galaxy15-[Purity_(beta)_on_data_8].tabular")$Name[1:30]
markers_purity40 <- read.delim("results/Galaxy17-[Purity_(beta)_on_data_8].tabular")$Name[1:40]
markers_purity75 <- read.delim("results/Galaxy19-[Purity_(beta)_on_data_8].tabular")$Name[1:75]
markers_purity100 <- read.delim("results/Galaxy21-[Purity_(beta)_on_data_8].tabular")$Name[1:100]
markers_purity_list <- list(markers_purity10, markers_purity20, markers_purity30, markers_purity40,
markers_purity, markers_purity75, markers_purity100)
curve_df_purity <- data.frame(N_markers = lengths(markers_purity_list),
Prop_unique = sapply(markers_purity_list,
function(x) propUnique(numgen[x,])),
DiffScore = sapply(markers_purity_list,
function(x) DiffScore(grp_alfreq[x,])),
DivScore = sapply(markers_purity_list,
function(x) DivScore(grp_alfreq[x,])))
```
```{r eval = FALSE}
set.seed(1117)
markers_simanneal_list <- lapply(c(10, 20, 30, 40, 50, 75, 100),
function(x) findMarkerSet(grp_alfreq[rownames(snpdf3),], nSNP = x)$Set)
curve_df_simanneal <- data.frame(N_markers = lengths(markers_simanneal_list),
Prop_unique = sapply(markers_simanneal_list,
function(x) propUnique(numgen[x,])),
DiffScore = sapply(markers_simanneal_list,
function(x) DiffScore(grp_alfreq[x,])),
DivScore = sapply(markers_simanneal_list,
function(x) DivScore(grp_alfreq[x,])))
save(markers_simanneal_list, curve_df_simanneal, file = "results/simanneal_curve.RData")
```
```{r echo = FALSE}
load("results/simanneal_curve.RData")
```
```{r}
curve_df_purity$Method <- "Purity"
curve_df_simanneal$Method <- "Simulated annealing"
curve_df_comb <- rbind(curve_df_purity, curve_df_simanneal)
ggplot(curve_df_comb, aes(x = N_markers, y = Prop_unique, color = Method)) +
geom_line()
```
With 75 or 100 markers, all accessions could be distinguished using markers
selected with the Purity method. The simulated annealing method still only
distinguished 88% of accessions with 100 markers, as it was trying to optimize
diversity rather than identification of accessions.
```{r}
ggplot(curve_df_comb, aes(x = N_markers, y = DiffScore, color = Method)) +
geom_line()
```
The simulated annealing algorithm was better than the Purity method at
differentiating populations. The number of markers included did not
have much of an impact on either method.
```{r}
ggplot(curve_df_comb, aes(x = N_markers, y = DivScore, color = Method)) +
geom_line()
```
As the number of markers increased, the Purity method captured more diversity
within populations, although not as much as the simulated annealing method.