-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathSetup.Rmd
220 lines (165 loc) · 6.29 KB
/
Setup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "Setup for marker design"
author: Lindsay Clark, HPCBio, Roy J. Carver Biotechnology Center, University of Illinois,
Urbana-Champaign
output:
html_document:
df_print: paged
---
```{r include = FALSE}
options(Biostrings.coloring = FALSE)
```
This document contains setup steps that you should only have to do once.
If you haven't already, you'll need to clone this repository to your computer
and open it as an RStudio project as described in the [README](README.md).
## Packages to install
Certain Bioconductor and CRAN packages must be installed for these tutorials to work.
They can be installed with the following code. It is best to run this
installation code as soon as R is opened, before doing any other work.
```{r eval = FALSE}
install.packages(c("tidyverse", "ape", "adegenet", "viridis"))
BiocManager::install(c("VariantAnnotation", "Rsamtools"))
```
If you are asked to update packages, I recommend answering "a" for "all"
just to make sure everything is up-to-date.
## Data
For convenience, this repository has a `data` folder where you can put
your VCF, reference FASTA, and any other input files. You don't have
to put them here, but if they are somewhere else you will need to
specify the full path to them.
In this case I have three VCFs for two different projects, but you might just
have one.
```{r}
vcf1 <- "data/331_new_data_vcf_IBRC.vcf"
vcf2 <- "data/unique_173_clones.recode.vcf"
vcf3 <- "data/yam336.m2M2vsnps_missing0.9.recode.vcf"
genfasta <- "data/TDr96_F1_v2_PseudoChromosome.rev07.fasta"
```
## Building indices
An index is a type of file that enables bioinformatics software to quickly
find any genomic region of interest within a file. For example, if you
needed the sequence from a region of chromosome 5, an index of your reference
FASTA file will help the software to jump directly to that region, rather
than having to read all of chromosomes 1 through 4 first.
```{r message = FALSE, warning = FALSE}
library(VariantAnnotation)
library(Rsamtools)
library(dplyr)
```
First we'll index the reference genome.
```{r eval = FALSE}
indexFa(genfasta)
```
You should now have a file called `r paste0(genfasta, ".fai")`.
If this worked, you'll be able to load the reference genome into R:
```{r}
refgenome <- FaFile(genfasta)
seqinfo(refgenome)
```
Next, we'll index the VCFs. We will also compress them first into a special
format for Bioconductor/Samtools.
```{r eval = FALSE}
bg1 <- bgzip(vcf1)
bg2 <- bgzip(vcf2)
indexTabix(bg1, format = "vcf")
indexTabix(bg2, format = "vcf")
```
The code below gets those file names back without redoing the compression.
```{r}
bg1 <- paste0(vcf1, ".bgz")
bg2 <- paste0(vcf2, ".bgz")
```
You should see ".bgz" and ".bgz.tbi" files in your data folder now.
If your compressing and indexing worked, you should be able to preview the VCF
now.
```{r}
hdr1 <- scanVcfHeader(bg1)
hdr1
hdr2 <- scanVcfHeader(bg2)
hdr2
```
## Importing SNP metadata only from a larger VCF
In some cases, you may have a much larger, unfiltered VCF. Although you don't
plan to mine markers from the VCF, you still need to know the location and
alleles of those SNPs so they can be annotated in the flanking sequence that you
will use for KASP design. The script `snp_positions_from_vcf.R` reads the VCF
and saves the SNP metadata to an R object. The script is designed to be run
on a bioinformatics server or cluster from the terminal like so:
``` bash
Rscript snp_positions_from_vcf.R --args yam336.m2M2vsnps_missing0.9.recode.vcf yam336.m2M2vsnps_missing0.9.recode.rds
```
After successfully running it, you can download the `.rds` file to your `data`
directory.
## Quality control
There are a few things you might want to check over in your VCF. We'll run through
a similar set of checks for the two files, but they will differ slightly depending
on how the files were generated.
### TASSEL
Let's look at the first 50 sample names. These should match the samples that
were in your study.
```{r}
samples(hdr1) %>% head(50)
```
You can also take a look at the variants themselves. Here we'll read the
SNP metadata without reading the genotypes.
```{r}
vcf1 <- readVcf(bg1, genome = genfasta,
param = ScanVcfParam(geno = NA))
rowRanges(vcf1)
```
In the `seqnames` column, we see chromosome names that have been shortened
by TASSEL. These don't match the names in `seqinfo(refgenome)`, so we'll
have to keep that in mind. There are functions in the `src` directory to help
with name conversion. If your VCF is different and you need to do some
conversion other than "OM" to "chrom", you might modify the `fixTasselNames`
function.
```{r}
source("src/marker_stats.R")
ranges1a <- rowRanges_correctedSeqnames(vcf1)
ranges1a
```
Next, we will use the DNA sequence to make sure that the reference genome
matches the VCF. First, we will retrieve the nucleotide at each SNP position
from the reference genome.
```{r}
refcheck1 <- scanFa(refgenome, ranges1a)
refcheck1
```
Does that match the reference allele in the VCF?
```{r}
mean(refcheck1 == rowRanges(vcf1)$REF)
```
It only matches 85% of the time. However, that is because TASSEL sometimes
lists the common allele as the reference allele, rather than the allele that
matches the reference. So, we'll also check the alternative allele.
```{r}
mean(rowRanges(vcf1)$REF == refcheck1 | unlist(rowRanges(vcf1)$ALT) == refcheck1)
```
Now we see a 100% match, so everything looks good.
### Samtools
Let's look at the first 10 sample names. In this case they contain the full
path to the BAM file, so we might want to keep that in mind if we need to
look up genotypes by sample later.
```{r}
samples(hdr2) %>% head(10)
```
You can also take a look at the variants themselves. Here we'll read the
SNP metadata without reading the genotypes.
```{r}
vcf2 <- readVcf(bg2, genome = genfasta,
param = ScanVcfParam(geno = NA))
rowRanges(vcf2)
```
We can see if the reference allele is what we would expect, given the sequence
from our reference genome.
```{r}
refcheck2 <- scanFa(refgenome, rowRanges(vcf2))
mean(rowRanges(vcf2)$REF == refcheck2) # this should return 1
```
The reference allele matches the reference genome 100% of the time.
We can check with our larger VCF as well.
```{r}
vcf3 <- readRDS(sub("vcf$", "rds", vcf3))
refcheck3 <- scanFa(refgenome, rowRanges(vcf3))
mean(rowRanges(vcf3)$REF == refcheck3) # this should return 1
```