Pedigree_verification.Rmd

---
title: "Selecting markers for pedigree verification"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
  Urbana-Champaign
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```

In this workflow, we'll identify markers useful for identifying accessions and
verifying pedigrees, and export those markers for KASP marker design.

## Setup

Be sure you have completed the [setup](Setup.md) steps before working through
this pipeline.  First we'll load all the needed packages and functions.

```{r libs, message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(adegenet)
library(ggplot2)
library(ape)
library(dplyr)
source("src/marker_stats.R")
source("src/getNumGeno.R")
source("src/evaluate_marker_sets.R")
```

Next we'll point to our VCF and reference genome files. Edit these lines to
point to your dataset.

```{r files}
bg <- "data/331_new_data_vcf_IBRC.vcf.bgz"
rds <- "data/yam336.m2M2vsnps_missing0.9.recode.rds" # unfiltered markers; optional
refgenome <- FaFile("data/TDr96_F1_v2_PseudoChromosome.rev07.fasta")
```

## Population statistics

We'll import all the genotypes in numeric format, indicating number of copies
of the alternative allele.

```{r numeric_gen}
numgen <- getNumGeno(bg)
numgen[1:10,1:10]
```

### Check for clonal duplicates

It is good to check that there aren't any clonal duplicates in the dataset, as
these could bias downstream analysis.  We'll calculate identity-by-state
and look at its distribution.

```{r echo = FALSE}
set.seed(1019) # to make report reproducible
```

```{r inddist}
myIBS <- interIndividualIBS(numgen)

hist(myIBS, xlab = "Identity-by-state", main = "Histogram of IBS")
```

```{r njtree, fig.height = 6, fig.width = 6}
plot(nj(1 - myIBS), type = "unrooted", show.tip.label = FALSE)
```

Here it looks like the most closely related individuals are family members, not
clones.  If there were some values closer to one, however, clones could be
removed using the `removeClones` function.

```{r removeclones}
numgen <- removeClones(numgen, myIBS)
```

### Observed and expected heterozygosity

Let's look at the distribution of allele frequencies in this dataset.

```{r alfreq}
alfreq <- rowMeans(numgen, na.rm = TRUE) / 2
hist(alfreq, xlab = "Allele frequency", main = "Histogram of alternative allele frequency")
```

We can look at the relationship between observed and expected heterozygosity
( _Ho_ and _He_, respectively).

```{r hohe}
He <- Expected_het(alfreq)
Ho <- rowMeans(numgen == 1, na.rm = TRUE)
hist(Ho / He)
```

In this case the mode is a little above 1, suggesting the presence of hybrid
lines and/or polyploids in the dataset.  The bump at 2 indicates likely paralogs.
In this case we'll set a cutoff at 1.5.

```{r hohe_vs_alfreq}
hohe_cutoff <- 1.5

ggplot(mapping = aes(x = alfreq, y = Ho/He)) +
  geom_point(alpha = 0.05) +
  geom_hline(yintercept = hohe_cutoff, color = "blue") +
  labs(x = "Allele frequency") +
  ggtitle("Filtering for putative paralogs")
```

We can see what proportion of markers we would lose at this cutoff.  It's about 11%.

```{r mean_hohe}
mean(Ho/He > hohe_cutoff)
```

We'll do that filtering now, then start a data frame to track statistics about each SNP.

```{r filter_hohe}
keep1 <- which(Ho/He < hohe_cutoff)

numgen <- numgen[keep1,]
snpdf <- data.frame(row.names = rownames(numgen),
                    AlFreq = alfreq[keep1],
                    Ho = Ho[keep1],
                    He = He[keep1])
```

In this case there are no missing data, but I am going to add that calculation
in for the sake of reusing the workflow.

```{r missrate}
snpdf$MissingRate <- rowMeans(is.na(numgen))
summary(snpdf)
```

We can also look at the heterozygosity across individuals.  The two outlier
individuals we see below are likely to be polyploids or inter-species hybrids.
We'll filter them out.

```{r filter_ind}
hetByInd <- colMeans(numgen == 1, na.rm = TRUE)
hist(hetByInd)

numgen <- numgen[, hetByInd < 0.5]
```

## Genetic groups

We could simply choose markers with high allele frequency at this point, but
instead I am going to identify genetic groups first.  That way, I can test the
allele frequency within each genetic group.  We generally want a variety of
markers so that we have some that are variable within each group, and some
that can distinguish groups.  We'll use the Discriminant Analysis of Principal
Components (DAPC) protocol implemented in the `adegenet` package.  In order
to knit this document I have to set `n.pca` and `n.da` manually, but for
your work I recommend a first pass where they are left as `NULL` so that you
can chose them based on where the graphs plateau.

```{r echo = FALSE}
set.seed(1016) # to ensure report reproducibility
```

```{r findclusters, eval = FALSE}
myclust <- find.clusters(t(numgen), n.pca = 100)
```

```{r loadclust, echo = FALSE}
load("results/myclust.RData")
plot(myclust$Kstat, xlab = "Number of clusters", ylab = "BIC",
     main = "Value of BIC vs. number of clusters", type = "o", col = "blue")
```

Based on the BIC graph, we'll select ten clusters.

Check the size of the groups.  If it is very uneven, you may want fewer groups.

```{r clustsize}
myclust$size
```

Now perform DAPC.

```{r dapc}
mydapc <- dapc(t(numgen), grp = myclust$grp, scale = TRUE, n.pca = 50, n.da = 2)
```

```{r dapcscatter}
scatter(mydapc, posi.da = "topright")
```

Group assignments are in `mydapc$assign`.  You many want to inspect some of the
more differentiated groups to see if they are biologically meaningful.

```{r checkgroups}
colnames(numgen)[mydapc$assign == 9]
colnames(numgen)[mydapc$assign == 4]
```

Now we can get allele frequencies within each genetic group.

```{r freq_by_group}
grp_alfreq <- sapply(levels(mydapc$assign),
                     function(x) rowMeans(numgen[, mydapc$assign == x], na.rm = TRUE) / 2)
colnames(grp_alfreq) <- paste0("AlFreq_Grp", colnames(grp_alfreq))
head(grp_alfreq)
```

## Technical parameters

For this portion of the workflow, we'll import SNP metadata using the
`VariantAnnotation` package.  If you are just working with one VCF,
use the following code:

```{r importvcf, eval = FALSE}
myvcf <- readVcf(bg,
                 param = ScanVcfParam(geno = NA))
```

Otherwise, if you have a larger VCF, with a bigger set of markers,
read it from the RDS made during setup:

```{r imporvcf2}
myvcf <- readRDS(rds)

rowRanges(myvcf)
```

Since we have quality scores, we will look at the distribution.

```{r qualhist}
hist(rowRanges(myvcf)$QUAL, xlab = "Quality score",
     main = "Histogram of quality scores in large VCF")
```

This suggests filtering to only keep the highest scores is advisable.
We will also make sure to keep any SNPs that were in our smaller VCF.

```{r filtervcf}
temp <- paste(seqnames(myvcf), start(myvcf), sep = "_")

myvcf <- myvcf[rowRanges(myvcf)$QUAL > 900 | 
                 temp %in% rownames(snpdf),]
```

### GC content

PCR tends to work best in regions that are 40-60% GC.  We will test
the GC content for the 50-bp flanking region for each SNP in our
table.

```{r gccontent}
snpdf$GCcontent <- gcContent(myvcf, rownames(snpdf), refgenome)

hist(snpdf$GCcontent, xlab = "GC content", main = "Histogram of GC content")

mean(snpdf$GCcontent >= 0.4 & snpdf$GCcontent <= 0.6)
```

About 36.5% of markers are within the target range for GC content.
We can subset to just look at these.

```{r gcsubset}
snpdf2 <- filter(snpdf, GCcontent >= 0.4, GCcontent <= 0.6)
```

### Number of flanking SNPs

Although we'll annotate flanking SNPs, the fewer there are, the less likely
the marker will have technical problems.

```{r nflank}
snpdf2$Nflanking <- nFlankingSNPs(myvcf, rownames(snpdf2))

hist(snpdf2$Nflanking, xlab = "Number of flanking SNPs",
     main = "Histogram of number of flanking SNPs")
table(snpdf2$Nflanking)
```

The cutoff is arbitrary, but let's keep markers with two or fewer flanking SNPs.

```{r flankfilt}
snpdf3 <- filter(snpdf2, Nflanking <= 2)
```

## Choosing a set of markers

Since we are selecting a small set of markers, we want to not only make sure
that each individual marker has a high minor allele frequency to maximize its
information content, but also that the markers complement each other well so
that they can distinguish individuals and genetic groups.

Two approaches are presented here.  One is a Galaxy tool developed by Carlos
Ignacio for finding sets of markers that distinguish all accessions.  The
second is an algorithm I developed to maximize diversity captured within and
between genetic groups.

### Galaxy tool

First, we have to export our set of suitable markers back to VCF.  Some conversion
of marker and chromosome names is done here in order to get everything to match.

```{r subvcf, eval = FALSE}
rr3 <- rowRanges(myvcf)[match(rownames(snpdf3),
                              paste(seqnames(myvcf), start(myvcf), sep = "_"))]
rr3a <- GRanges(sub("chrom", "OM", seqnames(rr3)), ranges(rr3))
subvcf <- readVcf(bg, genome = seqinfo(rr3a),
                  param = ScanVcfParam(which = rr3a))

writeVcf(subvcf, filename = "results/marker_subset.vcf")
```

Now, on Galaxy, use the
[Purity tool](http://cropgalaxy.excellenceinbreeding.org/root?tool_id=purity_beta)
on that VCF file.  I set it to select 50 markers after considering 10,000
solutions, and left the distance cutoff at 0.05.  Two files were output, and I
downloaded the first to my `results` folder.  Below I will import it to get
the marker set.

```{r}
markers_purity <- read.delim("results/Galaxy9-[Purity_(beta)_on_data_8].tabular")$Name[1:50]

sort(markers_purity)
```

### Simulated annealing algorithm

In the file [evaluate_marker_sets.R](src/evaluate_marker_sets.R) there is
a function to use a
[simulated annealing algorithm](https://en.wikipedia.org/wiki/Simulated_annealing)
to find optimal sets of markers that capture diversity within and between
genetic groups.

```{r echo = FALSE}
set.seed(1020) # for report reproducibility
```

```{r choosemarkers, message = FALSE}
markers_simanneal <- findMarkerSet(grp_alfreq[rownames(snpdf3),], nSNP = 50)$Set

sort(markers_simanneal)
```

### Comparing the results

How well do the markers distinguish individuals in the dataset?

```{r markerset_dist}
ibs_purity <- interIndividualIBS(numgen[markers_purity,])

hist(ibs_purity, xlab = "Identity-by-state",
     main = "Histogram of IBS using Purity marker subset")

mean(ibs_purity == 1)
propUnique(numgen[markers_purity,])

ibs_simanneal <- interIndividualIBS(numgen[markers_simanneal,])

hist(ibs_simanneal, xlab = "Identity-by-state",
     main = "Histogram of IBS using simulated annealing marker subset")

mean(ibs_simanneal == 1)
propUnique(numgen[markers_simanneal,])
```

If two individuals were selected at random, there is a 0.2% chance that they
would have identical genotypes across all markers using the simulated annealing
algorithm, as opposed to a 0.01% chance using markers from the Purity algorithm.
This corresponds to 75% of the individuals being unique with the simulated 
annealing set and 97% of individuals being unique with the Purity set.
So, the Purity algorithm is somewhat better for distinguishing individuals.

We can also compare the geometric mean of expected heterozygosity in all of the
populations.

```{r markerset_div}
DivScore(grp_alfreq[markers_purity,])
DivScore(grp_alfreq[markers_simanneal,])
```

The simulated annealing algorithm gives higher expected heterozygosity.

We can also see the geometric mean of Jost's D, a differentiation statistic
among populations.

```{r  markerset_diff}
DiffScore(grp_alfreq[markers_purity,])
DiffScore(grp_alfreq[markers_simanneal,])
```

The simulated annealing algorithm gives higher differentiation among populations.

## Exporting markers

Now we can get the flanking sequences for the markers that we selected.

```{r formatkasp}
markerseq1 <- formatKasp(myvcf, markers_purity, refgenome)
markerseq2 <- formatKasp(myvcf, markers_simanneal, refgenome)

head(markerseq1)
```

These can be exported to a spreadsheet.

```{r export}
write.csv(markerseq1, file = "results/pedigree_verification_markers_Purity.csv",
          row.names = FALSE)
write.csv(markerseq2, file = "results/pedigree_verification_markers_simanneal.csv",
          row.names = FALSE)
```

## Curve to determine optimal number of markers

An additional analysis was performed where the number of markers in the dataset was
varied, in order to determine the optimal number of markers.

```{r}
markers_purity10 <- read.delim("results/Galaxy11-[Purity_(beta)_on_data_8].tabular")$Name[1:10]
markers_purity20 <- read.delim("results/Galaxy13-[Purity_(beta)_on_data_8].tabular")$Name[1:20]
markers_purity30 <- read.delim("results/Galaxy15-[Purity_(beta)_on_data_8].tabular")$Name[1:30]
markers_purity40 <- read.delim("results/Galaxy17-[Purity_(beta)_on_data_8].tabular")$Name[1:40]
markers_purity75 <- read.delim("results/Galaxy19-[Purity_(beta)_on_data_8].tabular")$Name[1:75]
markers_purity100 <- read.delim("results/Galaxy21-[Purity_(beta)_on_data_8].tabular")$Name[1:100]

markers_purity_list <- list(markers_purity10, markers_purity20, markers_purity30, markers_purity40,
                            markers_purity, markers_purity75, markers_purity100)

curve_df_purity <- data.frame(N_markers = lengths(markers_purity_list),
                              Prop_unique = sapply(markers_purity_list,
                                                   function(x) propUnique(numgen[x,])),
                              DiffScore = sapply(markers_purity_list,
                                                    function(x) DiffScore(grp_alfreq[x,])),
                              DivScore = sapply(markers_purity_list,
                                                    function(x) DivScore(grp_alfreq[x,])))
```

```{r eval = FALSE}
set.seed(1117)
markers_simanneal_list <- lapply(c(10, 20, 30, 40, 50, 75, 100),
                                 function(x) findMarkerSet(grp_alfreq[rownames(snpdf3),], nSNP = x)$Set)

curve_df_simanneal <- data.frame(N_markers = lengths(markers_simanneal_list),
                                 Prop_unique = sapply(markers_simanneal_list,
                                                   function(x) propUnique(numgen[x,])),
                                 DiffScore = sapply(markers_simanneal_list,
                                                    function(x) DiffScore(grp_alfreq[x,])),
                                 DivScore = sapply(markers_simanneal_list,
                                                    function(x) DivScore(grp_alfreq[x,])))
save(markers_simanneal_list, curve_df_simanneal, file = "results/simanneal_curve.RData")
```

```{r echo = FALSE}
load("results/simanneal_curve.RData")
```

```{r}
curve_df_purity$Method <- "Purity"
curve_df_simanneal$Method <- "Simulated annealing"
curve_df_comb <- rbind(curve_df_purity, curve_df_simanneal)

ggplot(curve_df_comb, aes(x = N_markers, y = Prop_unique, color = Method)) +
  geom_line()
```

With 75 or 100 markers, all accessions could be distinguished using markers
selected with the Purity method.  The simulated annealing method still only
distinguished 88% of accessions with 100 markers, as it was trying to optimize
diversity rather than identification of accessions.

```{r}
ggplot(curve_df_comb, aes(x = N_markers, y = DiffScore, color = Method)) +
  geom_line()
```

The simulated annealing algorithm was better than the Purity method at
differentiating populations.  The number of markers included did not
have much of an impact on either method.

```{r}
ggplot(curve_df_comb, aes(x = N_markers, y = DivScore, color = Method)) +
  geom_line()
```

As the number of markers increased, the Purity method captured more diversity
within populations, although not as much as the simulated annealing method.