-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathMarker_assisted_selection.Rmd
338 lines (248 loc) · 9.93 KB
/
Marker_assisted_selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Selecting markers for marker-assisted selection"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
Urbana-Champaign
date: "October 2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Setup
Be sure you have completed the [setup](Setup.md) steps before working through
this pipeline. First we'll load all the needed packages and functions.
```{r libs, message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(ggplot2)
library(viridis)
source("src/marker_stats.R")
source("src/getNumGeno.R")
source("src/qtl_markers.R")
```
Below are names of genomics files. Edit these to match your file names.
```{r files}
bg <- "data/unique_173_clones.recode.vcf.bgz"
rds <- "data/yam336.m2M2vsnps_missing0.9.recode.rds" # unfiltered markers; optional
refgenome <- FaFile("data/TDr96_F1_v2_PseudoChromosome.rev07.fasta")
```
## Data import
### Significant SNPs
We will import a spreadsheet listing markers of interest. I reformatted the
Excel file that was provided to make it more compatible with R. (I.e., deleted
all header rows aside from the top one, deleted empty rows, merged multiple
rows belonging to the same marker, and listed the trait in each row.)
```{r qtl}
yam_qtl <- read.csv("data/yam_qtl.csv", stringsAsFactors = FALSE)
str(yam_qtl)
```
We will make a chromosome column to match the chromosome names in the FASTA and
VCF files. We'll also make a marker name column with the allele trimmed off.
```{r qtlchr}
yam_qtl$Chromosome <- sub("_[[:digit:]]+_[ACGT]$", "", yam_qtl$Marker)
yam_qtl$Marker_short <- sub("_[ACGT]$", "", yam_qtl$Marker)
head(yam_qtl)
```
### Phenotypes
We can read in phenotype data so that we can see how well SNPs predict phenotypes.
```{r importpheno}
pheno <- read.csv("data/pheno_data all 174.csv")
head(pheno)
```
The column names should be made to match the QTL spreadsheet.
```{r matchtraits}
traits <- unique(yam_qtl$Trait)
names(pheno)
names(pheno) <- gsub("_201[78]", "", names(pheno))
names(pheno) <- gsub("_", " ", names(pheno))
names(pheno)
setdiff(traits, names(pheno)) # traits from the QTL file that haven't been matched in the phenotype file
setdiff(names(pheno), traits) # traits from the phenotype file that haven't been matched in the QTL file
names(pheno)[names(pheno) == "Spines on tuber"] <- "Spines on tuber surface"
names(pheno)[names(pheno) == "No of tubers"] <- "Number of tubers per plant"
names(pheno)[names(pheno) == "Yield plant"] <- "Yield per plant"
all(traits %in% names(pheno)) # should be TRUE
```
### Genotypes and annotations from VCF
We will specify ranges in which we wish to look at SNPs for KASP marker design.
Let's look within 100 kb of each significant SNP.
```{r qtlranges}
search_distance <- 1e5
qtl_ranges <- GRanges(yam_qtl$Chromosome,
IRanges(start = yam_qtl$Position - search_distance,
end = yam_qtl$Position + search_distance))
names(qtl_ranges) <- yam_qtl$Marker_short
qtl_ranges$Trait <- yam_qtl$Trait
```
We will import numeric genotypes just within these ranges.
```{r numgen}
numgen <- getNumGeno(bg, ranges = qtl_ranges)
str(numgen)
```
There are 5684 markers across 173 individuals, and genotypes are coded from
zero to two. We will change the accession names to match the phenotype spreadsheet.
```{r matchaccessions}
if(all(sub("_", "", colnames(numgen)) %in% pheno$DRS)){
colnames(numgen) <- sub("_", "", colnames(numgen))
}
all(colnames(numgen) %in% pheno$DRS) # should be TRUE
```
We will also import SNP metadata within these ranges.
```{r importvcf}
myvcf <- readVcf(bg,
param = ScanVcfParam(geno = NA, which = qtl_ranges))
rowRanges(myvcf)
```
We can see that the `paramRangeID` column indicates which original marker each
SNP is near. Since there were some significant SNPs close to each other, that
also means we have some duplicates in both `numgen` and `myvcf`.
```{r dupcheck}
identical(rownames(numgen), names(rowRanges(myvcf)))
as.logical(anyDuplicated(rownames(numgen)))
```
### Unfiltered VCF
In this case we had a much larger VCF with rarer SNPs, so we will import that too.
```{r imporvcf2}
bigvcf <- readRDS(rds)
rowRanges(bigvcf)
```
Since we have quality scores, we will look at the distribution.
```{r qualhist}
hist(rowRanges(bigvcf)$QUAL, xlab = "Quality score",
main = "Histogram of quality scores in large VCF")
```
This suggests filtering to only keep the highest scores is advisable.
We will also make sure to keep any SNPs that were in our smaller VCF.
```{r filtervcf}
temp <- paste(seqnames(bigvcf), start(bigvcf), sep = "_")
bigvcf <- bigvcf[rowRanges(bigvcf)$QUAL > 900 |
temp %in% names(myvcf),]
rm(temp)
```
Lastly, we will filter the VCF to only contain SNPs in our QTL ranges.
```{r subsetvcf}
bigvcf <- subsetByOverlaps(bigvcf, qtl_ranges)
```
## Technical parameters for marker design from significant SNPs
Ideally, we would like to design markers directly from the significant hits.
We should check that they will make reasonably good KASP markers first,
however.
### GC content
PCR will work best when GC content is 40-60%.
```{r gccontent}
yam_qtl$GCcontent <- gcContent(myvcf, yam_qtl$Marker_short, refgenome)
hist(yam_qtl$GCcontent, xlab = "GC content",
main = "GC content flanking significant SNPs", breaks = 20)
```
Many are below the desired range, so we may see if there are any nearby
SNPs in LD that have better GC content.
### Number of flanking SNPs
A few flanking SNPs are okay, but we want to make sure none of these
have an excessive amount.
```{r flankingsnps}
yam_qtl$Nflanking <- nFlankingSNPs(bigvcf, yam_qtl$Marker_short)
hist(yam_qtl$Nflanking, xlab = "Number of flanking SNPs",
main = "Number of flanking SNPs within 50 bp of significant SNPs")
table(yam_qtl$Nflanking)
```
For those with three or more, we might see if there are better markers.
## Evaluating nearby markers
### Finding markers in linkage disequilibrium (LD)
Below is a function that uses that information to estimate LD of every SNP
within range with the significant SNP. We will reorder the results to match
the table of significant SNPs.
```{r ld}
ld <- LD(numgen, myvcf)
ld <- ld[yam_qtl$Marker_short]
```
Let's also extract start positions for the sake of plotting.
```{r grangeslist}
snplist <- split(rowRanges(myvcf), rowRanges(myvcf)$paramRangeID)
snplist <- snplist[yam_qtl$Marker_short]
positions <- start(snplist)
```
```{r plotld}
i <- 1
ggplot(mapping = aes(x = positions[[i]], y = ld[[i]])) +
geom_point(alpha = 0.3) +
labs(x = "Position", y = "R-squared",
title = paste("Linkage disequilibrium with", names(snplist)[i]))
```
The actual SNP of interest shows up as 100% LD in the middle of the range. There
are a few nearby around 50% LD, which is not great but we might consider those if
the SNP of interest is in a very low GC region, for example.
### Finding markers correlating with the trait
For all markers nearby to our significant SNPs, let's also look at the R-squared
with the corresponding trait.
```{r rsq, warning = FALSE}
phen_rsq <- phenoCorr(numgen, myvcf, yam_qtl$Marker_short, yam_qtl$Trait,
pheno) ^ 2
phen_rsq <- phen_rsq[yam_qtl$Marker_short]
```
```{r plotrsq}
ggplot(mapping = aes(x = positions[[i]], y = phen_rsq[[i]],
color = ld[[i]])) +
geom_point(alpha = 0.7) +
labs(x = "Position", y = "R-squared",
title = paste("Association with", yam_qtl$Trait[i]),
color = "LD with hit") +
scale_color_viridis()
```
## Choosing markers to output
We may want to output a table of the best markers, along with some statistics so that
we can manually choose among them. Let's take the top ten R-squared values for
association with the trait of interest, and also make sure to get the significant
SNP itself.
```{r top10}
n <- 10 # edit this number if you want to keep a different number of markers
top10 <- lapply(phen_rsq, function(x){
x1 <- sort(x, decreasing = TRUE)
if(length(x1) > n){
x1 <- x1[1:n]
}
return(names(x1))
} )
top10tab <- utils::stack(top10)
colnames(top10tab) <- c("SNP_ID", "QTL")
# Add in any QTL that weren't in top 10 associated SNPs
toadd <- setdiff(yam_qtl$Marker_short, top10tab$SNP_ID)
top10tab <- rbind(top10tab,
data.frame(SNP_ID = toadd, QTL = toadd))
top10tab <- top10tab[order(factor(top10tab$QTL, levels = yam_qtl$Marker_short)),]
```
Now we'll get the KASP-formatted sequence for these markers.
```{r formatkasp}
outtab <- formatKasp(bigvcf, top10tab$SNP_ID, refgenome)
outtab <- cbind(outtab, top10tab[,"QTL", drop = FALSE])
```
We'll add the trait name for each QTL.
```{r qtltrait}
outtab$Trait <- yam_qtl$Trait[match(outtab$QTL, yam_qtl$Marker_short)]
```
We'll add in linkage disequilibrium and trait association data, and also mark
which allele was positively associated with the trait.
```{r ldrsq, warning = FALSE}
extr <- function(x, y, lst){
return(lst[[x]][y])
}
outtab$LD_with_QTL <- mapply(extr, outtab$QTL, outtab$SNP_ID, MoreArgs = list(ld))
outtab$R2_with_trait <- mapply(extr, outtab$QTL, outtab$SNP_ID, MoreArgs = list(phen_rsq))
als <- whichAlleleList(numgen, myvcf, yam_qtl$Marker_short, yam_qtl$Trait,
pheno)
als <- als[yam_qtl$Marker_short]
outtab$Pos_allele <- mapply(function(x, y, ...) as.character(extr(x, y, ...)),
outtab$QTL, outtab$SNP_ID, MoreArgs = list(als))
```
We will also add the GC content and number of flanking SNPs.
```{r gcouttab}
outtab$GC_content <- gcContent(myvcf, outtab$SNP_ID, refgenome)
outtab$N_flanking <- nFlankingSNPs(bigvcf, outtab$SNP_ID)
head(outtab)
```
Now we have a data frame that we can export to a spreadsheet, and we can
manually select SNPs for development as KASP markers. I recommend selecting
the SNP matching the QTL, plus one or two more SNPs that are similarly
associated with the trait and have as close to optimal GC content as possible.
```{r export}
write.csv(outtab, file = "results/mas_markers.csv", row.names = FALSE)
```