Setup.Rmd

---
title: "Setup for marker design"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
  Urbana-Champaign
output:
  html_document:
    df_print: paged
---

```{r include = FALSE}
options(Biostrings.coloring = FALSE)
```

This document contains setup steps that you should only have to do once.
If you haven't already, you'll need to clone this repository to your computer
and open it as an RStudio project as described in the [README](README.md).

## Packages to install

Certain Bioconductor and CRAN packages must be installed for these tutorials to work.
They can be installed with the following code.  It is best to run this
installation code as soon as R is opened, before doing any other work.

```{r eval = FALSE}
install.packages(c("tidyverse", "ape", "adegenet", "viridis"))
BiocManager::install(c("VariantAnnotation", "Rsamtools"))
```

If you are asked to update packages, I recommend answering "a" for "all"
just to make sure everything is up-to-date.

## Data

For convenience, this repository has a `data` folder where you can put
your VCF, reference FASTA, and any other input files.  You don't have
to put them here, but if they are somewhere else you will need to
specify the full path to them.

In this case I have three VCFs for two different projects, but you might just
have one.

```{r}
vcf1 <- "data/331_new_data_vcf_IBRC.vcf"
vcf2 <- "data/unique_173_clones.recode.vcf"
vcf3 <- "data/yam336.m2M2vsnps_missing0.9.recode.vcf"
genfasta <- "data/TDr96_F1_v2_PseudoChromosome.rev07.fasta"
```

## Building indices

An index is a type of file that enables bioinformatics software to quickly
find any genomic region of interest within a file.  For example, if you
needed the sequence from a region of chromosome 5, an index of your reference
FASTA file will help the software to jump directly to that region, rather
than having to read all of chromosomes 1 through 4 first.

```{r message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(dplyr)
```

First we'll index the reference genome.

```{r eval = FALSE}
indexFa(genfasta)
```

You should now have a file called `r paste0(genfasta, ".fai")`.
If this worked, you'll be able to load the reference genome into R:

```{r}
refgenome <- FaFile(genfasta)
seqinfo(refgenome)
```

Next, we'll index the VCFs.  We will also compress them first into a special
format for Bioconductor/Samtools.

```{r eval = FALSE}
bg1 <- bgzip(vcf1)
bg2 <- bgzip(vcf2)
indexTabix(bg1, format = "vcf")
indexTabix(bg2, format = "vcf")
```

The code below gets those file names back without redoing the compression.

```{r}
bg1 <- paste0(vcf1, ".bgz")
bg2 <- paste0(vcf2, ".bgz")
```

You should see ".bgz" and ".bgz.tbi" files in your data folder now.
If your compressing and indexing worked, you should be able to preview the VCF
now.

```{r}
hdr1 <- scanVcfHeader(bg1)
hdr1
hdr2 <- scanVcfHeader(bg2)
hdr2
```

## Importing SNP metadata only from a larger VCF

In some cases, you may have a much larger, unfiltered VCF.  Although you don't
plan to mine markers from the VCF, you still need to know the location and
alleles of those SNPs so they can be annotated in the flanking sequence that you
will use for KASP design. The script `snp_positions_from_vcf.R` reads the VCF
and saves the SNP metadata to an R object.  The script is designed to be run
on a bioinformatics server or cluster from the terminal like so:

``` bash
Rscript snp_positions_from_vcf.R --args yam336.m2M2vsnps_missing0.9.recode.vcf yam336.m2M2vsnps_missing0.9.recode.rds
```

After successfully running it, you can download the `.rds` file to your `data`
directory.

## Quality control

There are a few things you might want to check over in your VCF. We'll run through
a similar set of checks for the two files, but they will differ slightly depending
on how the files were generated.

### TASSEL

Let's look at the first 50 sample names.  These should match the samples that
were in your study.

```{r}
samples(hdr1) %>% head(50)
```

You can also take a look at the variants themselves.  Here we'll read the
SNP metadata without reading the genotypes.

```{r}
vcf1 <- readVcf(bg1, genome = genfasta,
               param = ScanVcfParam(geno = NA))
rowRanges(vcf1)
```

In the `seqnames` column, we see chromosome names that have been shortened
by TASSEL.  These don't match the names in `seqinfo(refgenome)`, so we'll
have to keep that in mind.  There are functions in the `src` directory to help
with name conversion.  If your VCF is different and you need to do some
conversion other than "OM" to "chrom", you might modify the `fixTasselNames`
function.

```{r}
source("src/marker_stats.R")

ranges1a <- rowRanges_correctedSeqnames(vcf1)
ranges1a
```

Next, we will use the DNA sequence to make sure that the reference genome
matches the VCF.  First, we will retrieve the nucleotide at each SNP position
from the reference genome.

```{r}
refcheck1 <- scanFa(refgenome, ranges1a)
refcheck1
```

Does that match the reference allele in the VCF?

```{r}
mean(refcheck1 == rowRanges(vcf1)$REF)
```

It only matches 85% of the time.  However, that is because TASSEL sometimes
lists the common allele as the reference allele, rather than the allele that
matches the reference.  So, we'll also check the alternative allele.

```{r}
mean(rowRanges(vcf1)$REF == refcheck1 | unlist(rowRanges(vcf1)$ALT) == refcheck1)
```

Now we see a 100% match, so everything looks good.

### Samtools

Let's look at the first 10 sample names.  In this case they contain the full
path to the BAM file, so we might want to keep that in mind if we need to
look up genotypes by sample later.

```{r}
samples(hdr2) %>% head(10)
```

You can also take a look at the variants themselves.  Here we'll read the
SNP metadata without reading the genotypes.

```{r}
vcf2 <- readVcf(bg2, genome = genfasta,
               param = ScanVcfParam(geno = NA))
rowRanges(vcf2)
```

We can see if the reference allele is what we would expect, given the sequence
from our reference genome.

```{r}
refcheck2 <- scanFa(refgenome, rowRanges(vcf2))
mean(rowRanges(vcf2)$REF == refcheck2) # this should return 1
```

The reference allele matches the reference genome 100% of the time.

We can check with our larger VCF as well.

```{r}
vcf3 <- readRDS(sub("vcf$", "rds", vcf3))

refcheck3 <- scanFa(refgenome, rowRanges(vcf3))
mean(rowRanges(vcf3)$REF == refcheck3) # this should return 1
```