forked from IsoldevR/R_session_PopGen
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathR_PopGen.Rmd
537 lines (373 loc) · 16 KB
/
R_PopGen.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
---
title: "R Session Population Genetics"
output:
html_document:
df_print: paged
editor_options:
chunk_output_type: inline
---
Install the following packages.
For the sequence data manipulation and missingness plotting:
```{r, message=FALSE, warning=FALSE}
library(seqinr)
library(gplots)
library(RColorBrewer)
library(graphics)
```
For the population structure visualisation in pophelper:
```{r, message=FALSE, warning=FALSE}
library(Cairo)
library(ggplot2)
library(gridExtra)
library(gtable)
library(tidyr)
library(pophelper)
```
For the population genetics statistics:
```{r, message=FALSE, warning=FALSE}
library(adegenet)
library(pegas)
library(poppr)
library(hierfstat)
```
# FILE FORMATS
## FASTA (`seqinr`)
With the package `seqinr` you can read in a fasta file.
Below you see an example where I used an aligned fasta file with triplicated samples (library in triplicate, treated as separate samples during assembly). I want to know whether these sequences have differences between replicates of individuals, or whether differences could be attritbuted to missing data.
First we read in the data:
```{r}
original<-read.fasta("RAD_Bufo_replicated_individuals.fasta", as.string=TRUE, set.attributes = FALSE)
```
Now I am making a for loop to look at differences caused by missing data (- and n), or by alternative genotype calling (c, t, g, a or IUPAC codes).
First I make empty matrices to fill during the for loop.
```{r}
Out_Perc_diff<-matrix(nrow=length(original), ncol=0)
Out_Perc_no_worries<-matrix(nrow=length(original), ncol=0)
```
Then I loop trough the fasta file.
```{r}
for (i in 1:length(original)){
#Make empty vectors for second loop
col_diff<-c()
col_no_worries<-c()
Seq_1<-getSequence(original[[i]])
for (j in 1:length(original)){
#Get raw percentage of difference between sequences, includes ALL differences
Seq_2<-getSequence(original[[j]])
Same<-sum(Seq_1==Seq_2, na.rm=TRUE) #All positions which are equal
Diff<-sum(!Seq_1==Seq_2, na.rm=TRUE) #All positions which are different
stopifnot((Same+Diff)==length(Seq_1)) #Check
Perc_diff<-(Diff/length(Seq_1))
#Get percentage of difference because of N
Seq_1_N<-Seq_1=="n"
Seq_2_N<-Seq_2=="n"
#Get percentage of difference because of -
Seq_1_I<-Seq_1=="-"
Seq_2_I<-Seq_2=="-"
#Get the exceptions, the differences we are not afraid of (missing data: N and -)
#The - (indel) in paired end is very common when not sequencing the whole selected fragment
Diff_no_w_1<-sum(!Seq_1_N==Seq_2_N, na.rm=TRUE)
Diff_no_w_2<-sum(!Seq_1_I==Seq_2_I, na.rm=TRUE)
Diff_no_w_3<-sum(Seq_1_N==TRUE & Seq_1_N==Seq_2_I | Seq_1_I==TRUE & Seq_1_I==Seq_2_N, na.rm=TRUE)
Perc_no_worries<-(((Diff_no_w_1+Diff_no_w_2)-Diff_no_w_3)/length(Seq_1))
#Add the numbers to make a column
col_diff<-c(col_diff, Perc_diff)
col_no_worries<-c(col_no_worries, Perc_no_worries)
}
#Add the columns to the matrix
Out_Perc_diff<-cbind(Out_Perc_diff, col_diff)
Out_Perc_no_worries<-cbind(Out_Perc_no_worries, col_no_worries)
}
```
Then I add rownames to the matrices, mainly to be able to produce figures.
```{r}
rownames(Out_Perc_diff)<-names(original)
colnames(Out_Perc_diff)<-names(original)
rownames(Out_Perc_no_worries)<-names(original)
colnames(Out_Perc_no_worries)<-names(original)
```
Then I calculate what we came for: the actual number of 'wrongly' called SNPs!
```{r}
Diff_worry<-Out_Perc_diff-Out_Perc_no_worries
```
I order them in an order that I prefer, as I want to use this for visualisation.
```{r}
All_diff<-c(Diff_worry['1572_A','1572_B'], Diff_worry['1572_B','1572_C'],Diff_worry['1572_A','1572_C'],
Diff_worry['1573_A','1573_B'], Diff_worry['1573_B','1573_C'],Diff_worry['1573_A','1573_C'],
Diff_worry['1574_A','1574_B'], Diff_worry['1574_B','1574_C'],Diff_worry['1574_A','1574_C'],
Diff_worry['4112_A','4112_B'], Diff_worry['4112_B','4112_C'],Diff_worry['4112_A','4112_C'],
Diff_worry['4113_A','4113_B'], Diff_worry['4113_B','4113_C'],Diff_worry['4113_A','4113_C'],
Diff_worry['4114_A','4114_B'], Diff_worry['4114_B','4114_C'],Diff_worry['4114_A','4114_C']
)
```
And then I take the mean of all the differences that worried me!
```{r}
mean(All_diff)
```
Good, we can see that only a very small portion (<1%) of the differences between the replicates of individuals is due to an actually different nucleotide.
Now I want to visualise this with "heat tables" to compare between the influence of real mistakes vs missing data.
First, I made a custom palette to colour my "heat table".
```{r}
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
```
And I define the color breaks manually for a "skewed" color transition, because I know some values are not included.
```{r}
col_breaks = c(seq(0,.3,length=100), # for green
seq(0.3001,0.6,length=100), # for yellow
seq(.6001,.9,length=100)) # for red
```
And I make Figure 1: to only see the influence of missing data.
First I need to set the dimensions of the "image" I am creating.
```{r}
dim1 <- ncol(Out_Perc_no_worries)
dim2 <- ncol(Diff_worry)
```
Then I make the actual image, give it axes, and fill it with the numbers I calculated above.
```{r}
image(1:dim1, 1:dim1, Out_Perc_no_worries, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
axis(1, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=3)+
axis(2, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=1)+
text(expand.grid(1:dim1, 1:dim1), sprintf("%0.2f", Out_Perc_no_worries), cex=0.54)
```
# Question
- Why do we observe the two square blocks in the table?
Now we also want to see this figure for the "really concerning" differences between the sequences.
```{r}
image(1:dim2, 1:dim2, Diff_worry, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
axis(1, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=3)+
axis(2, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=1)+
text(expand.grid(1:dim2, 1:dim2), sprintf("%0.2f", Diff_worry), cex=0.54)
```
Cool! Compared to figure 1, figure 2 shows really less differences between the replicates between individuals.
There are other packages available, which you can use to explore your data, also for other file formats. We are not going trough all of them, but here are a few examples of packages you can use.
## VCF
### `vcfR`
To read in and manipulate a vcf file, one can use the R package vcfR:
https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html
## BAM/SAM
### `scanBam`
To import, count, index, filter, sort, and merge BAM files:
https://rdrr.io/bioc/Rsamtools/man/scanBam.html
### `Rsamtools`
To read in and manipulate a BAM/SAM file, one can use the R package Rsamtools:
https://kasperdanielhansen.github.io/genbioconductor/html/Rsamtools.html
# Data filtering
It is nice to be able to filter your data manually. This allows you to understand better what you are working with. Keep in mind that with large SNP data this is probably not possible. We here will only discuss what to do with smaller datasets. For larger datasets we may set up another session!
Examples of SNPs that you might want to filter out are singletons. Or you might want to work with only the di-allelic SNPs in your dataset, because many softwares only utilise di-allelic SNPs. In my case, I wanted to see which SNPs were present as one variant in one of my species, and as another in another species. With these SNPs I would be able to analyse the hybrids between both species. In the next code, I filtered my data for such 'species specific' SNPs based on my assumed 'pure' populations.
![](Fig1_Maps.png)
First I read in my SNPs.
```{r}
SNPtable<-as.data.frame(read.table("RAD_Bufo_SNPtable.txt", header=TRUE))
knitr::kable(SNPtable[1:5,1:5])
```
Then I remove columns where one species only has NA (possibly null aleles). I used the maximum amount of individuals with "NA" by using "<=54" and "<=30".
```{r}
SNPtable_1<-SNPtable[,colSums(is.na(SNPtable[1:74,])) <= 54]
SNPtable_2<-SNPtable_1[,colSums(is.na(SNPtable_1[75:114,])) <= 30]
```
Then I check my dimensions.
```{r}
nrow(SNPtable_2)
```
```{r}
ncol(SNPtable_2)
```
And I split the table up in two tables for each species that I have: B_bufo and B_spino.
```{r}
B_bufo<-SNPtable_2[1:74,]
B_spino<-SNPtable_2[75:114,]
```
One final check, are the first columns the right populations?
```{r}
knitr::kable(unique(B_bufo[,1]))
```
```{r}
knitr::kable(unique(B_spino[,1]))
```
Now I can start comparing the two tables
Again I prepare empty matrices, but I do not know how many SNPs of each type I will get.
```{r}
Not_diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)
Not_diagnostic<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)
```
And now I loop through all SNPs (columns) in this table:
```{r}
for (i in 3:ncol(SNPtable_2)) {
x_tot<-B_bufo[,i]
y_tot<-B_spino[,i]
#compare the two sets of numbers, returns a block with true/false
ans<-vapply(x_tot, function(x_tot) x_tot==y_tot, logical(length(y_tot)))
#if true occurs save in table with not_diagnostic markers
if(any(ans, na.rm=TRUE)|(all(is.na(y_tot))|all(is.na(x_tot)))){
Not_diagnostic<-cbind(Not_diagnostic_old, SNPtable_2[,i, drop=FALSE])
#if only false or NA save in table with diagnostic markers
} else {
Diagnostic<-cbind(Diagnostic_old, SNPtable_2[,i, drop=FALSE])
}
Not_diagnostic_old<-Not_diagnostic
Diagnostic_old<-Diagnostic
}
```
Awesome, now I want to summarize and check if all went well!
```{r}
Tot_original<-ncol(SNPtable)-2
Tot_original_B_bufo<-ncol(B_bufo)-2
Tot_original_B_spino<-ncol(B_spino)-2
Tot_not_diagnostic<-ncol(Not_diagnostic_old)
Tot_diagnostic<-ncol(Diagnostic_old)
Tot_not_diagnostic+Tot_diagnostic
Perc_diagnostic<-Tot_diagnostic/Tot_original*100
```
And finally I can write the diagnostic snps to a file.
```{r}
write.table(Diagnostic_old, file="RAD_Bufo_Diagnostic.txt", quote=FALSE, sep=" ", col.names=TRUE)
```
# Questions
- How many diagnostic markers do I have?
- How many non-diagnostic markers do I have?
- Does it add up to the total of markers in the input file?
- Where did the other markers go?
# Challenge
Write a code that takes one SNP per RAD locus randomly.
## Structure
You can run structure-like analysis in R (includes some nice mapping options):
http://membres-timc.imag.fr/Olivier.Francois/tutoRstructure.pdf
Or:
https://github.com/Tom-Jenkins/admixture_pie_chart_map_tutorial/blob/master/pie_chart_admixture_map_tutorial.R
I love the possibilities with visualisation with Pophelper, but other packages are available.
First, read in some data. These are the q scores for the nuclear structure run, and frequencies for mtDNA.
```{r}
Struct_table<-read.table("Structure_31SNPs_qscores_nuclear.txt", header = TRUE)
Struct_table_mtDNA<-read.table("Structure_31SNPs_qscores_mtDNA.txt", header = TRUE)
```
Now make the data structure appropriate for PopHelper, for each structure run you want to visualise, you need to make a list with the clusters assigned. For example, if want to visualise k=3 for your data, you would need to extend the list with "Cluster3" and the values.
```{r}
Q1<-list(structure_1=data.frame(Cluster1=c(Struct_table[,3]),
Cluster2=c(Struct_table[,4])))
Q2<-list(mtDNA=data.frame(Cluster1=c(Struct_table_mtDNA[,3]),
Cluster2=c(Struct_table_mtDNA[,4])))
```
Then I make the population labels appropriate for PopHelper.
```{r}
Pop_label<-Struct_table[,1, drop = FALSE]
Pop_label$Pop <- as.character(Pop_label$Pop)
```
And I check whether this now includes all my populations I want to visualise.
```{r}
nrow(Pop_label)
```
And the master-piece plot! It is only one line of code!! b(^-^)d
```{r, message=FALSE, warning=FALSE, include=FALSE}
plotQ(qlist=c(Q2[1], Q1[1]), imgoutput='join', clustercol=c("blue", "red"), grplab=Pop_label, grplabsize = 1, showlegend=T, divtype= 1, legendlab=c("Bufo bufo","Bufo spinosus"), splab=c("mtDNA", "Structure"), linecol="white", pointtype=NA, grplabpos=1, imgtype="png", outputfilename = "Structure_plot_bufo")
```
![](Structure_plot_bufo.png)
Look at the many options available for your population structure plotting in R with the PopHelper package here:
http://www.royfrancis.com/pophelper/articles/index.html#plotq
## Population genetic analyses in `adegenet`
### `adegenet`
Read in the data & replace missing data with the mean frequency (there are other methods available).
Other file formats are possible, but here we use a structure file.
First, we read in the data.
```{r, include=FALSE}
obj1<-read.structure("RAD_Bufo_alldata.str",n.ind = 373,n.loc = 4836,col.lab = 1,col.pop = 2,col.others = NA,row.marknames = 0,onerowperind = F)
```
And I check the populations in this new "obj1" to see if I have all my data now.
```{r}
levels(obj1$pop)
```
There is an option to replace missing data with the mean frequency.
```{r}
NA_mean<-tab(obj1, NA.method="mean")
```
## Principal Component Analysis
We can perform a PCA with adegenet.
```{r}
pca1<-dudi.pca(NA_mean, scannf=F, scale=F,nf=3)
```
And we make a nice plot.
```{r}
s.class(pca1$li, pop(obj1), axesel=FALSE, cstar=1, cpoint=1, col=c(rep("black", 7),
rep("blue", 4),
rep("black", 2),
rep("red", 5), "blue",
"black", "black",
rep("black", 12)))
```
This looks good.
We can also ask for summary of this PCA.
```{r}
summary(pca1)
```
# Question
- What is the variance explained per axis?
## Summaries
You can get basic information on your data using summaries, which returns both a visible output and an invisible list. This is the summary for population 11 in transect 1.
```{r}
summary(obj1[obj1$pop=="T1_11",1:100])
```
## Hardy Weinberg equilibrium
You can check whether your data is out of Hardy Weinberg equilibrium with adegenet, too.
For example, we can do the Hardy Weinberg test without the Monte Carlo test, for population 11.
```{r, warning=FALSE}
Test_pop_T1_11<-hw.test(obj1[obj1@pop=="T1_11",1:100], B=0)
```
And we can check the output.
```{r}
knitr::kable(Test_pop_T1_11[1:10,])
```
# Questions
- Why do we have rows with no output?
- Why do we have only 43 loci?
## F-statistics
You can also look for population structure with F-statistics. You can calculate Fst (), pop/totalFit (ind/total), and Fis (ind/pop) by locus.
First select only a couple of the populations that we know are diverse (hybrid), this will also make the F-statistics analysis run faster. A good package to know is poppr.
```{r}
obj2<-poppr::popsub(obj1, sublist = c("T1_11", "T1_12", "T1_13", "T1_14"))
```
Then we only want to use a couple of loci.
```{r}
obj3<-obj2[loc=c(1:10)]
```
And we need to convert this to a loci object.
```{r}
loc1<-as.loci(obj3)
```
Now our loc1 object has 10 loci for 4 populations.
```{r}
length(loc1)
```
```{r}
length(unique(loc1$population))
```
And we are able to calculate the different types of F-statistics for each of our loci
```{r}
Fst(loc1, pop=loc1$population)
```
And we can caluclate the pairwise Fst for each population (needs a gene object again):
```{r}
pairwise.fst(obj3, pop=obj3$pop, res.type = "matrix")
```
## Inbreeding
We can check for inbreeding, which is a real common problem in amphibian populations:
```{r}
temp <- inbreeding(obj3, N=10)
```
Then we can look for the results for the first individual in our dataset (1480_S104).
```{r}
knitr::kable(temp$`1480_S104`)
```
We can then take the mean for each individual.
```{r}
MeanFst <- sapply(temp, mean)
```
And plot it.
```{r}
hist(MeanFst)
```
Adegenet has many other options:
DAPC & Compoplot
With adegenet:
http://adegenet.r-forge.r-project.org/files/tutorial-dapc.pdf
## THE END ~ HAPPY CODING!!