Marker_assisted_selection.Rmd

---
title: "Selecting markers for marker-assisted selection"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
  Urbana-Champaign
date: "October 2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Setup

Be sure you have completed the [setup](Setup.md) steps before working through
this pipeline.  First we'll load all the needed packages and functions.

```{r libs, message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(ggplot2)
library(viridis)
source("src/marker_stats.R")
source("src/getNumGeno.R")
source("src/qtl_markers.R")
```

Below are names of genomics files.  Edit these to match your file names.

```{r files}
bg <- "data/unique_173_clones.recode.vcf.bgz"
rds <- "data/yam336.m2M2vsnps_missing0.9.recode.rds" # unfiltered markers; optional
refgenome <- FaFile("data/TDr96_F1_v2_PseudoChromosome.rev07.fasta")
```

## Data import

### Significant SNPs

We will import a spreadsheet listing markers of interest.  I reformatted the
Excel file that was provided to make it more compatible with R. (I.e., deleted
all header rows aside from the top one, deleted empty rows, merged multiple
rows belonging to the same marker, and listed the trait in each row.)

```{r qtl}
yam_qtl <- read.csv("data/yam_qtl.csv", stringsAsFactors = FALSE)

str(yam_qtl)
```

We will make a chromosome column to match the chromosome names in the FASTA and
VCF files.  We'll also make a marker name column with the allele trimmed off.

```{r qtlchr}
yam_qtl$Chromosome <- sub("_[[:digit:]]+_[ACGT]$", "", yam_qtl$Marker)
yam_qtl$Marker_short <- sub("_[ACGT]$", "", yam_qtl$Marker)
head(yam_qtl)
```

### Phenotypes

We can read in phenotype data so that we can see how well SNPs predict phenotypes.

```{r importpheno}
pheno <- read.csv("data/pheno_data all 174.csv")
head(pheno)
```

The column names should be made to match the QTL spreadsheet.

```{r matchtraits}
traits <- unique(yam_qtl$Trait)
names(pheno)

names(pheno) <- gsub("_201[78]", "", names(pheno))
names(pheno) <- gsub("_", " ", names(pheno))
names(pheno)

setdiff(traits, names(pheno)) # traits from the QTL file that haven't been matched in the phenotype file
setdiff(names(pheno), traits) # traits from the phenotype file that haven't been matched in the QTL file

names(pheno)[names(pheno) == "Spines on tuber"] <- "Spines on tuber surface"
names(pheno)[names(pheno) == "No of tubers"] <- "Number of tubers per plant"
names(pheno)[names(pheno) == "Yield plant"] <- "Yield per plant"

all(traits %in% names(pheno)) # should be TRUE
```

### Genotypes and annotations from VCF

We will specify ranges in which we wish to look at SNPs for KASP marker design.
Let's look within 100 kb of each significant SNP.

```{r qtlranges}
search_distance <- 1e5
qtl_ranges <- GRanges(yam_qtl$Chromosome,
                      IRanges(start = yam_qtl$Position - search_distance,
                              end = yam_qtl$Position + search_distance))
names(qtl_ranges) <- yam_qtl$Marker_short
qtl_ranges$Trait <- yam_qtl$Trait
```

We will import numeric genotypes just within these ranges.

```{r numgen}
numgen <- getNumGeno(bg, ranges = qtl_ranges)
str(numgen)
```

There are 5684 markers across 173 individuals, and genotypes are coded from
zero to two.  We will change the accession names to match the phenotype spreadsheet.

```{r matchaccessions}
if(all(sub("_", "", colnames(numgen)) %in% pheno$DRS)){
  colnames(numgen) <- sub("_", "", colnames(numgen))
}

all(colnames(numgen) %in% pheno$DRS) # should be TRUE
```

We will also import SNP metadata within these ranges.

```{r importvcf}
myvcf <- readVcf(bg,
                 param = ScanVcfParam(geno = NA, which = qtl_ranges))

rowRanges(myvcf)
```

We can see that the `paramRangeID` column indicates which original marker each
SNP is near.  Since there were some significant SNPs close to each other, that
also means we have some duplicates in both `numgen` and `myvcf`.

```{r dupcheck}
identical(rownames(numgen), names(rowRanges(myvcf)))

as.logical(anyDuplicated(rownames(numgen)))
```

### Unfiltered VCF

In this case we had a much larger VCF with rarer SNPs, so we will import that too.

```{r imporvcf2}
bigvcf <- readRDS(rds)

rowRanges(bigvcf)
```

Since we have quality scores, we will look at the distribution.

```{r qualhist}
hist(rowRanges(bigvcf)$QUAL, xlab = "Quality score",
     main = "Histogram of quality scores in large VCF")
```

This suggests filtering to only keep the highest scores is advisable.
We will also make sure to keep any SNPs that were in our smaller VCF.

```{r filtervcf}
temp <- paste(seqnames(bigvcf), start(bigvcf), sep = "_")

bigvcf <- bigvcf[rowRanges(bigvcf)$QUAL > 900 | 
                temp %in% names(myvcf),]
rm(temp)
```

Lastly, we will filter the VCF to only contain SNPs in our QTL ranges.

```{r subsetvcf}
bigvcf <- subsetByOverlaps(bigvcf, qtl_ranges)
```

## Technical parameters for marker design from significant SNPs

Ideally, we would like to design markers directly from the significant hits.
We should check that they will make reasonably good KASP markers first,
however.

### GC content

PCR will work best when GC content is 40-60%.

```{r gccontent}
yam_qtl$GCcontent <- gcContent(myvcf, yam_qtl$Marker_short, refgenome)

hist(yam_qtl$GCcontent, xlab = "GC content",
     main = "GC content flanking significant SNPs", breaks = 20)
```

Many are below the desired range, so we may see if there are any nearby
SNPs in LD that have better GC content.

### Number of flanking SNPs

A few flanking SNPs are okay, but we want to make sure none of these
have an excessive amount.

```{r flankingsnps}
yam_qtl$Nflanking <- nFlankingSNPs(bigvcf, yam_qtl$Marker_short)

hist(yam_qtl$Nflanking, xlab = "Number of flanking SNPs",
     main = "Number of flanking SNPs within 50 bp of significant SNPs")
table(yam_qtl$Nflanking)
```

For those with three or more, we might see if there are better markers.

## Evaluating nearby markers

### Finding markers in linkage disequilibrium (LD)

Below is a function that uses that information to estimate LD of every SNP
within range with the significant SNP.  We will reorder the results to match
the table of significant SNPs.

```{r ld}
ld <- LD(numgen, myvcf)
ld <- ld[yam_qtl$Marker_short]
```

Let's also extract start positions for the sake of plotting.

```{r grangeslist}
snplist <- split(rowRanges(myvcf), rowRanges(myvcf)$paramRangeID)
snplist <- snplist[yam_qtl$Marker_short]
positions <- start(snplist)
```

```{r plotld}
i <- 1
ggplot(mapping = aes(x = positions[[i]], y = ld[[i]])) +
  geom_point(alpha = 0.3) +
  labs(x = "Position", y = "R-squared",
       title = paste("Linkage disequilibrium with", names(snplist)[i]))
```

The actual SNP of interest shows up as 100% LD in the middle of the range.  There
are a few nearby around 50% LD, which is not great but we might consider those if
the SNP of interest is in a very low GC region, for example.

### Finding markers correlating with the trait

For all markers nearby to our significant SNPs, let's also look at the R-squared
with the corresponding trait.

```{r rsq, warning = FALSE}
phen_rsq <- phenoCorr(numgen, myvcf, yam_qtl$Marker_short, yam_qtl$Trait,
                      pheno) ^ 2
phen_rsq <- phen_rsq[yam_qtl$Marker_short]
```

```{r plotrsq}
ggplot(mapping = aes(x = positions[[i]], y = phen_rsq[[i]],
                     color = ld[[i]])) +
  geom_point(alpha = 0.7) +
  labs(x = "Position", y = "R-squared",
       title = paste("Association with", yam_qtl$Trait[i]),
       color = "LD with hit") +
  scale_color_viridis()
```

## Choosing markers to output

We may want to output a table of the best markers, along with some statistics so that
we can manually choose among them.  Let's take the top ten R-squared values for
association with the trait of interest, and also make sure to get the significant
SNP itself.

```{r top10}
n <- 10 # edit this number if you want to keep a different number of markers

top10 <- lapply(phen_rsq, function(x){
  x1 <- sort(x, decreasing = TRUE)
  if(length(x1) > n){
    x1 <- x1[1:n]
  }
  return(names(x1))
} )

top10tab <- utils::stack(top10)
colnames(top10tab) <- c("SNP_ID", "QTL")

# Add in any QTL that weren't in top 10 associated SNPs
toadd <- setdiff(yam_qtl$Marker_short, top10tab$SNP_ID)
top10tab <- rbind(top10tab,
                  data.frame(SNP_ID = toadd, QTL = toadd))
top10tab <- top10tab[order(factor(top10tab$QTL, levels = yam_qtl$Marker_short)),]
```

Now we'll get the KASP-formatted sequence for these markers.

```{r formatkasp}
outtab <- formatKasp(bigvcf, top10tab$SNP_ID, refgenome)
outtab <- cbind(outtab, top10tab[,"QTL", drop = FALSE])
```

We'll add the trait name for each QTL.

```{r qtltrait}
outtab$Trait <- yam_qtl$Trait[match(outtab$QTL, yam_qtl$Marker_short)]
```

We'll add in linkage disequilibrium and trait association data, and also mark
which allele was positively associated with the trait.

```{r ldrsq, warning = FALSE}
extr <- function(x, y, lst){
  return(lst[[x]][y])
}
outtab$LD_with_QTL <- mapply(extr, outtab$QTL, outtab$SNP_ID, MoreArgs = list(ld))
outtab$R2_with_trait <- mapply(extr, outtab$QTL, outtab$SNP_ID, MoreArgs = list(phen_rsq))

als <- whichAlleleList(numgen, myvcf, yam_qtl$Marker_short, yam_qtl$Trait,
                       pheno)
als <- als[yam_qtl$Marker_short]
outtab$Pos_allele <- mapply(function(x, y, ...) as.character(extr(x, y, ...)),
                            outtab$QTL, outtab$SNP_ID, MoreArgs = list(als))
```

We will also add the GC content and number of flanking SNPs.

```{r gcouttab}
outtab$GC_content <- gcContent(myvcf, outtab$SNP_ID, refgenome)
outtab$N_flanking <- nFlankingSNPs(bigvcf, outtab$SNP_ID)

head(outtab)
```

Now we have a data frame that we can export to a spreadsheet, and we can
manually select SNPs for development as KASP markers.  I recommend selecting
the SNP matching the QTL, plus one or two more SNPs that are similarly
associated with the trait and have as close to optimal GC content as possible.

```{r export}
write.csv(outtab, file = "results/mas_markers.csv", row.names = FALSE)
```