-
Notifications
You must be signed in to change notification settings - Fork 66
/
Copy pathREADME.Rmd
916 lines (656 loc) · 28.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
---
title: "PCAtools: everything Principal Component Analysis"
author: "Kevin Blighe, Aaron Lun"
date: "`r Sys.Date()`"
package: "`r packageVersion('PCAtools')`"
output:
github_document:
toc: true
toc_depth: 3
fig_width: 7
bibliography: library.bib
vignette: >
%\VignetteIndexEntry{PCAtools: everything Principal Component Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
---
# Introduction
Principal Component Analysis (PCA) is a very powerful technique that has wide applicability in data science, bioinformatics, and further afield. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed. It extracts the fundamental structure of the data without the need to build any model to represent it. This 'summary' of the data is arrived at through a process of reduction that can transform the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components'), while at the same time being capable of easy interpretation on the original data [@PCAtools] [@BligheK].
*PCAtools* provides functions for data exploration via PCA, and allows the user to generate publication-ready figures. PCA is performed via *BiocSingular* [@Lun] - users can also identify optimal number of principal components via different metrics, such as elbow method and Horn's parallel analysis [@Horn] [@Buja], which has relevance for data reduction in single-cell RNA-seq (scRNA-seq) and high dimensional mass cytometry data.
```{r VROOM_CONNECTION_SIZE, eval = TRUE, echo = FALSE}
Sys.setenv(VROOM_CONNECTION_SIZE='512000')
```
# Installation
## 1. Download the package from Bioconductor
```{r getPackage, eval = FALSE}
if (!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('PCAtools')
```
Note: to install development version direct from GitHub:
```{r getPackageDevel, eval = FALSE}
if (!requireNamespace('remotes', quietly = TRUE))
install.packages('remotes')
remotes::install_github('kevinblighe/PCAtools')
```
## 2. Load the package into R session
```{r Load, message = FALSE}
library(PCAtools)
```
# Quick start: *DESeq2*
For this example, we will follow the tutorial (from Section 3.1) of [RNA-seq workflow: gene-level
exploratory analysis and differential expression](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html). Specifically, we will load the 'airway' data, where different airway smooth muscle cells were treated with dexamethasone.
```{r message = FALSE}
library(airway)
library(magrittr)
data('airway')
airway$dex %<>% relevel('untrt')
```
Annotate the Ensembl gene IDs to gene symbols:
```{r message = FALSE}
ens <- rownames(airway)
library(org.Hs.eg.db)
symbols <- mapIds(org.Hs.eg.db, keys = ens,
column = c('SYMBOL'), keytype = 'ENSEMBL')
symbols <- symbols[!is.na(symbols)]
symbols <- symbols[match(rownames(airway), names(symbols))]
rownames(airway) <- symbols
keep <- !is.na(rownames(airway))
airway <- airway[keep,]
```
Normalise the data and transform the normalised counts to variance-stabilised expression levels:
```{r message = FALSE, warning = FALSE}
library('DESeq2')
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds)
vst <- assay(vst(dds))
```
## Conduct principal component analysis (PCA):
```{r}
p <- pca(vst, metadata = colData(airway), removeVar = 0.1)
```
## A scree plot
```{r ex1, warning = FALSE, fig.height = 7, fig.width = 6, fig.cap = 'Figure 1: A scree plot'}
screeplot(p, axisLabSize = 18, titleLabSize = 22)
```
## A bi-plot
Different interpretations of the biplot exist. In the OMICs era, for most
general users, a biplot is a simple representation of samples in a
2-dimensional space, usually focusing on just the first two PCs:
```{r ex2a, eval = FALSE}
biplot(p)
```
However, the original definition of a biplot by Gabriel KR [@Gabriel] is a
plot that plots both variables and observations (samples) in the same space.
The variables are indicated by arrows drawn from the origin, which indicate
their 'weight' in different directions. We touch on this later via the
*plotLoadings* function.
```{r ex2b, fig.height = 7, fig.width = 7, fig.cap = 'Figure 2: A bi-plot'}
biplot(p, showLoadings = TRUE,
labSize = 5, pointSize = 5, sizeLoadingsNames = 5)
```
# Quick start: Gene Expression Omnibus (GEO)
Here, we will instead start with data from [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/).
We will load breast cancer gene expression data with recurrence free survival (RFS) from
[Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2990).
First, let's read in and prepare the data:
```{r, message = FALSE}
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE)
mat <- exprs(gset[[1]])
# remove Affymetrix control probes
mat <- mat[-grep('^AFFX', rownames(mat)),]
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('relation', 'age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'size:ch1',
'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# tidy column names
colnames(metadata) <- c('Study', 'Age', 'Distant.RFS', 'ER', 'GGI', 'Grade',
'Size', 'Time.RFS')
# prepare certain phenotypes of interest
metadata$Study <- gsub('Reanalyzed by: ', '', as.character(metadata$Study))
metadata$Age <- as.numeric(gsub('^KJ', NA, as.character(metadata$Age)))
metadata$Distant.RFS <- factor(metadata$Distant.RFS,
levels = c(0,1))
metadata$ER <- factor(gsub('\\?', NA, as.character(metadata$ER)),
levels = c(0,1))
metadata$ER <- factor(ifelse(metadata$ER == 1, 'ER+', 'ER-'),
levels = c('ER-', 'ER+'))
metadata$GGI <- as.numeric(as.character(metadata$GGI))
metadata$Grade <- factor(gsub('\\?', NA, as.character(metadata$Grade)),
levels = c(1,2,3))
metadata$Grade <- gsub(1, 'Grade 1', gsub(2, 'Grade 2', gsub(3, 'Grade 3', metadata$Grade)))
metadata$Grade <- factor(metadata$Grade, levels = c('Grade 1', 'Grade 2', 'Grade 3'))
metadata$Size <- as.numeric(as.character(metadata$Size))
metadata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', NA, metadata$Time.RFS))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any(is.na(x)))
metadata <- metadata[!discard,]
# filter the expression data to match the samples in our pdata
mat <- mat[,which(colnames(mat) %in% rownames(metadata))]
# check that sample names match exactly between pdata and expression data
all(colnames(mat) == rownames(metadata))
```
Conduct principal component analysis (PCA):
```{r}
p <- pca(mat, metadata = metadata, removeVar = 0.1)
```
## A bi-plot
```{r ex3a, eval = FALSE}
biplot(p)
```
```{r ex3b, fig.height = 6, fig.width = 7, fig.cap = 'Figure 3: A bi-plot'}
biplot(p, showLoadings = TRUE, lab = NULL)
```
One of the probes pointing downward is *205225_at*, which targets the *ESR1*
gene. This is already a useful validation, as the oestrogen receptor, which
is in part encoded by *ESR1*, is strongly represented by PC2 (y-axis), with
negative-to-positive receptor status going from top-to-bottom.
More on this later in this vignette.
## A pairs plot
```{r ex4, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = 'Figure 4: A pairs plot'}
pairsplot(p)
```
## A loadings plot
If the biplot was previously generated with *showLoadings = TRUE*, check how
this loadings plot corresponds to the biplot loadings - they should match up
for the top hits.
```{r ex5, fig.height = 6, fig.width = 8, fig.cap = 'Figure 5: A loadings plot'}
plotloadings(p, labSize = 3)
```
## An eigencor plot
```{r ex6, warning = FALSE, fig.height = 4, fig.width = 8, fig.cap = 'Figure 6: An eigencor plot'}
eigencorplot(p,
metavars = c('Study','Age','Distant.RFS','ER',
'GGI','Grade','Size','Time.RFS'))
```
## Access the internal data
The rotated data that represents the observations / samples is stored in *rotated*,
while the variable loadings are stored in *loadings*
```{r}
p$rotated[1:5,1:5]
p$loadings[1:5,1:5]
```
# Advanced features
All functions in *PCAtools* are highly configurable and should cover virtually
all basic and advanced user requirements. The following sections take a look
at some of these advanced features, and form a somewhat practical example of
how one can use *PCAtools* to make a clinical interpretation of data.
First, let's sort out the gene annotation by mapping the probe IDs to gene
symbols. The array used for this study was the Affymetrix U133a, so let's use
the *hgu133a.db* Bioconductor package:
```{r, messages = FALSE}
suppressMessages(require(hgu133a.db))
newnames <- mapIds(hgu133a.db,
keys = rownames(p$loadings),
column = c('SYMBOL'),
keytype = 'PROBEID')
# tidy up for NULL mappings and duplicated gene symbols
newnames <- ifelse(is.na(newnames) | duplicated(newnames),
names(newnames), newnames)
rownames(p$loadings) <- newnames
```
## Determine optimum number of PCs to retain
A scree plot on its own just shows the accumulative proportion of explained
variation, but how can we determine the optimum number of PCs to retain?
*PCAtools* provides four metrics for this purpose:
* Elbow method (`findElbowPoint()`)
* Horn's parallel analysis [@Horn] [@Buja] (`parallelPCA()`)
* Marchenko-Pastur limit (`chooseMarchenkoPastur()`)
* Gavish-Donoho method (`chooseGavishDonoho()`)
Let's perform Horn's parallel analysis first:
```{r, warning = FALSE}
horn <- parallelPCA(mat)
horn$n
```
Now the elbow method:
```{r}
elbow <- findElbowPoint(p$variance)
elbow
```
In most cases, the identified values will disagree. This is because finding the
correct number of PCs is a difficult task and is akin to finding the 'correct'
number of clusters in a dataset - there is no correct answer.
Taking these values, we can produce a new scree plot and mark these:
```{r ex7, fig.height = 7, fig.width = 9, fig.cap = 'Figure 7: Advanced scree plot illustrating optimum number of PCs'}
library(ggplot2)
screeplot(p,
components = getComponents(p, 1:20),
vline = c(horn$n, elbow)) +
geom_label(aes(x = horn$n + 1, y = 50,
label = 'Horn\'s', vjust = -1, size = 8)) +
geom_label(aes(x = elbow + 1, y = 50,
label = 'Elbow method', vjust = -1, size = 8))
```
If all else fails, one can simply take the number of PCs that contributes to
a pre-selected total of explained variation, e.g., in this case, 27 PCs account
for >80% explained variation.
```{r}
which(cumsum(p$variance) > 80)[1]
```
## Modify bi-plots
The bi-plot comparing PC1 versus PC2 is the most characteristic plot of PCA.
However, PCA is much more than the bi-plot and much more than PC1 and PC2. This
said, PC1 and PC2, by the very nature of PCA, are indeed usually the most
important parts of a PCA analysis.
In a bi-plot, we can shade the points by different groups and add many more features.
### Colour by a metadata factor, use a custom label, add lines through origin, and add legend
```{r ex8, fig.height = 7, fig.width = 7.5, fig.cap = 'Figure 8: Colour by a metadata factor, use a custom label, add lines through origin, and add legend'}
biplot(p,
lab = paste0(p$metadata$Age, ' años'),
colby = 'ER',
hline = 0, vline = 0,
legendPosition = 'right')
```
### Supply custom colours and encircle variables by group
The encircle functionality literally draws a polygon around
each group specified by *colby*. It says nothing about any statistic
pertaining to each group.
```{r ex9, message = FALSE, fig.height = 7.5, fig.width = 7, fig.cap = 'Figure 9: Supply custom colours and encircle variables by group'}
biplot(p,
colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'),
colLegendTitle = 'ER-\nstatus',
# encircle config
encircle = TRUE,
encircleFill = TRUE,
hline = 0, vline = c(-25, 0, 25),
legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
biplot(p,
colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'),
colLegendTitle = 'ER-\nstatus',
# encircle config
encircle = TRUE, encircleFill = FALSE,
encircleAlpha = 1, encircleLineSize = 5,
hline = 0, vline = c(-25, 0, 25),
legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
```
### Stat ellipses
Stat ellipses are also drawn around each group but have a greater statistical
meaning and can be used, for example, as a strict determination of outlier
samples. Here, we draw ellipses around each group at the 95% confidence level:
```{r ex10, message = FALSE, fig.height = 7.5, fig.width = 7, fig.cap = 'Figure 10: Stat ellipses'}
biplot(p,
colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'),
# ellipse config
ellipse = TRUE,
ellipseType = 't',
ellipseLevel = 0.95,
ellipseFill = TRUE,
ellipseAlpha = 1/4,
ellipseLineSize = 1.0,
xlim = c(-125,125), ylim = c(-50, 80),
hline = 0, vline = c(-25, 0, 25),
legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
biplot(p,
colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'),
# ellipse config
ellipse = TRUE,
ellipseType = 't',
ellipseLevel = 0.95,
ellipseFill = TRUE,
ellipseAlpha = 1/4,
ellipseLineSize = 0,
ellipseFillKey = c('ER+' = 'yellow', 'ER-' = 'pink'),
xlim = c(-125,125), ylim = c(-50, 80),
hline = 0, vline = c(-25, 0, 25),
legendPosition = 'top', legendLabSize = 16, legendIconSize = 8.0)
```
### Change shape based on tumour grade, remove connectors, and add titles
```{r ex11, message = FALSE, eval = FALSE}
biplot(p,
colby = 'ER', colkey = c('ER+' = 'forestgreen', 'ER-' = 'purple'),
hline = c(-25, 0, 25), vline = c(-25, 0, 25),
legendPosition = 'top', legendLabSize = 13, legendIconSize = 8.0,
shape = 'Grade', shapekey = c('Grade 1' = 15, 'Grade 2' = 17, 'Grade 3' = 8),
drawConnectors = FALSE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2',
caption = '27 PCs ≈ 80%')
```
### Modify line types, remove gridlines, and increase point size
```{r ex12a}
biplot(p,
lab = NULL,
colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'),
hline = c(-25, 0, 25), vline = c(-25, 0, 25),
vlineType = c('dotdash', 'solid', 'dashed'),
gridlines.major = FALSE, gridlines.minor = FALSE,
pointSize = 5,
legendPosition = 'left', legendLabSize = 14, legendIconSize = 8.0,
shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8),
drawConnectors = FALSE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2',
caption = '27 PCs ≈ 80%')
```
Let's plot the same as above but with loadings:
```{r ex12b, message = FALSE, fig.height = 5.5, fig.width = 11, fig.cap = 'Figure 11: Modify line types, remove gridlines, and increase point size'}
biplot(p,
# loadings parameters
showLoadings = TRUE,
lengthLoadingsArrowsFactor = 1.5,
sizeLoadingsNames = 4,
colLoadingsNames = 'red4',
# other parameters
lab = NULL,
colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'),
hline = 0, vline = c(-25, 0, 25),
vlineType = c('dotdash', 'solid', 'dashed'),
gridlines.major = FALSE, gridlines.minor = FALSE,
pointSize = 5,
legendPosition = 'left', legendLabSize = 14, legendIconSize = 8.0,
shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8),
drawConnectors = FALSE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2',
caption = '27 PCs ≈ 80%')
```
### Colour by a continuous variable and plot other PCs
There are two ways to colour by a continuous variable. In the first way, we simply
'add on' a continuous colour scale via *scale_colour_gradient*:
```{r ex13a, fig.height = 5.5, fig.width = 6, fig.cap = 'Figure 12: Colour by a continuous variable and plot other PCs'}
# add ESR1 gene expression to the metadata
p$metadata$ESR1 <- mat['205225_at',]
biplot(p,
x = 'PC2', y = 'PC3',
lab = NULL,
colby = 'ESR1',
shape = 'ER',
hline = 0, vline = 0,
legendPosition = 'right') +
scale_colour_gradient(low = 'gold', high = 'red2')
```
We can also just permit that the internal *ggplot2* engine picks the colour
scheme - here, we also plot PC10 versus PC50:
```{r ex13b, eval = FALSE}
# was always eval = FALSE
biplot(p, x = 'PC10', y = 'PC50',
lab = NULL,
colby = 'Age',
hline = 0, vline = 0,
hlineWidth = 1.0, vlineWidth = 1.0,
gridlines.major = FALSE, gridlines.minor = TRUE,
pointSize = 5,
legendPosition = 'left', legendLabSize = 16, legendIconSize = 8.0,
shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8),
drawConnectors = FALSE,
title = 'PCA bi-plot',
subtitle = 'PC10 versus PC50',
caption = '27 PCs ≈ 80%')
```
## Quickly explore potentially informative PCs via a pairs plot
The pairs plot in PCA unfortunately suffers from a lack of use; however, for
those who love exploring data and squeezing every last ounce of information out
of data, a pairs plot provides for a relatively quick way to explore useful
leads for other downstream analyses.
As the number of pairwise plots increases, however, space becomes limited. We
can shut off titles and axis labeling to save space. Reducing point size and
colouring by a variable of interest can additionally help us to rapidly skim
over the data.
```{r ex14, message = FALSE, fig.height = 8, fig.width = 7, fig.cap = 'Figure 13: Quickly explore potentially informative PCs via a pairs plot'}
pairsplot(p,
components = getComponents(p, c(1:10)),
triangle = TRUE, trianglelabSize = 12,
hline = 0, vline = 0,
pointSize = 0.4,
gridlines.major = FALSE, gridlines.minor = FALSE,
colby = 'Grade',
title = 'Pairs plot', plotaxes = FALSE,
margingaps = unit(c(-0.01, -0.01, -0.01, -0.01), 'cm'))
```
We can arrange these in a way that makes better use of the screen space by
setting 'triangle = FALSE'. In this case, we can further control the layout
with the 'ncol' and 'nrow' parameters, although, the function will
automatically determine these based on your input data.
```{r ex15, fig.height = 5.5, fig.width = 6, fig.cap = 'Figure 14: arranging a pairs plot horizontally'}
pairsplot(p,
components = getComponents(p, c(4,33,11,1)),
triangle = FALSE,
hline = 0, vline = 0,
pointSize = 0.8,
gridlines.major = FALSE, gridlines.minor = FALSE,
colby = 'ER',
title = 'Pairs plot', titleLabSize = 22,
axisLabSize = 14, plotaxes = TRUE,
margingaps = unit(c(0.1, 0.1, 0.1, 0.1), 'cm'))
```
## Determine the variables that drive variation among each PC
If, on the bi-plot or pairs plot, we encounter evidence that 1 or more PCs
are segregating a factor of interest, we can explore further the genes that
are driving these differences along each PC.
For each PC of interest, 'plotloadings' determines the variables falling within
the top/bottom 5% of the loadings range, and then creates a final consensus
list of these. These variables are then plotted.
The loadings plot, like all others, is highly configurable. To modify the
cut-off for inclusion / exclusion of variables, we use *rangeRetain*, where
0.01 equates to the top/bottom 1% of the loadings range per PC.
```{r ex16, fig.height = 5.5, fig.width = 7, fig.cap = 'Figure 15: Determine the variables that drive variation among each PC'}
plotloadings(p,
rangeRetain = 0.01,
labSize = 4.0,
title = 'Loadings plot',
subtitle = 'PC1, PC2, PC3, PC4, PC5',
caption = 'Top 1% variables',
shape = 24,
col = c('limegreen', 'black', 'red3'),
drawConnectors = TRUE)
```
At least one interesting finding is *205225_at* / *ESR1*, which is by far the gene
most responsible for variation along PC2. The previous bi-plots showed that
this PC also segregated ER+ from ER- patients. The other results could be
explored. Also, from the biplots with loadings that we have already generated,
this result is also verified in these.
With the loadings plot, in addition, we can instead plot absolute values and
modify the point sizes to be proportional to the loadings. We can also switch
off the line connectors and plot the loadings for any PCs
```{r ex17a, fig.height = 7, fig.width = 9, fig.cap = 'Figure 16: plotting absolute component loadings'}
plotloadings(p,
components = getComponents(p, c(4,33,11,1)),
rangeRetain = 0.1,
labSize = 4.0,
absolute = FALSE,
title = 'Loadings plot',
subtitle = 'Misc PCs',
caption = 'Top 10% variables',
shape = 23, shapeSizeRange = c(1, 16),
col = c('white', 'pink'),
drawConnectors = FALSE)
```
We can plot just this single PC and flip the plot on its side, if we wish:
```{r ex17b, fig.height = 4.5, fig.width = 9, fig.cap = 'Figure 17: plotting absolute component loadings'}
plotloadings(p,
components = getComponents(p, c(2)),
rangeRetain = 0.12, absolute = TRUE,
col = c('black', 'pink', 'red4'),
drawConnectors = TRUE, labSize = 4) + coord_flip()
```
## Correlate the principal components back to the clinical data
Further exploration of the PCs can come through correlations with clinical
data. This is also a mostly untapped resource in the era of 'big data' and
can help to guide an analysis down a particular path.
We may wish, for example, to correlate all PCs that account for 80% variation in
our dataset and then explore further the PCs that have statistically significant
correlations.
'eigencorplot' is built upon another function by the *PCAtools* developers, namely
[CorLevelPlot](https://github.com/kevinblighe/CorLevelPlot). Further examples
can be found there.
**NB - for factors, ensure that these are ordered in a logical fashion prior to running this function**
```{r ex18a, warning = FALSE, fig.height = 4.25, fig.width = 12, fig.cap = 'Figure 18: Correlate the principal components back to the clinical data'}
eigencorplot(p,
components = getComponents(p, 1:27),
metavars = c('Study','Age','Distant.RFS','ER',
'GGI','Grade','Size','Time.RFS'),
col = c('darkblue', 'blue2', 'black', 'red2', 'darkred'),
cexCorval = 0.7,
colCorval = 'white',
fontCorval = 2,
posLab = 'bottomleft',
rotLabX = 45,
posColKey = 'top',
cexLabColKey = 1.5,
scale = TRUE,
main = 'PC1-27 clinical correlations',
colFrame = 'white',
plotRsquared = FALSE)
```
We can also supply different cut-offs for statistical significance, apply
p-value adjustment, plot R-squared values, and specify correlation method:
```{r ex18b, warning = FALSE, fig.height = 5, fig.width = 12, fig.cap = 'Figure 19: Correlate the principal components back to the clinical data'}
eigencorplot(p,
components = getComponents(p, 1:horn$n),
metavars = c('Study','Age','Distant.RFS','ER','GGI',
'Grade','Size','Time.RFS'),
col = c('white', 'cornsilk1', 'gold', 'forestgreen', 'darkgreen'),
cexCorval = 1.2,
fontCorval = 2,
posLab = 'all',
rotLabX = 45,
scale = TRUE,
main = bquote(Principal ~ component ~ Pearson ~ r^2 ~ clinical ~ correlates),
plotRsquared = TRUE,
corFUN = 'pearson',
corUSE = 'pairwise.complete.obs',
corMultipleTestCorrection = 'BH',
signifSymbols = c('****', '***', '**', '*', ''),
signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1))
```
Clearly, PC2 is coming across as the most interesting PC in this experiment,
with highly statistically significant correlation (p<0.0001) to ER status,
tumour grade, and GGI (genomic Grade Index), an indicator of response.
It comes as no surprise that the gene driving most variationn along PC2 is
*ESR1*, identified from our loadings plot.
This information is, of course, not new, but shows how PCA is much more than
just a bi-plot used to identify outliers!
## Plot the entire project on a single panel
```{r ex19, message = FALSE, warning = FALSE, fig.height = 10, fig.width = 15, fig.cap = 'Figure 20: a merged panel of all PCAtools plots'}
pscree <- screeplot(p, components = getComponents(p, 1:30),
hline = 80, vline = 27, axisLabSize = 14, titleLabSize = 20,
returnPlot = FALSE) +
geom_label(aes(20, 80, label = '80% explained variation', vjust = -1, size = 8))
ppairs <- pairsplot(p, components = getComponents(p, c(1:3)),
triangle = TRUE, trianglelabSize = 12,
hline = 0, vline = 0,
pointSize = 0.8, gridlines.major = FALSE, gridlines.minor = FALSE,
colby = 'Grade',
title = '', plotaxes = FALSE,
margingaps = unit(c(0.01, 0.01, 0.01, 0.01), 'cm'),
returnPlot = FALSE)
pbiplot <- biplot(p,
# loadings parameters
showLoadings = TRUE,
lengthLoadingsArrowsFactor = 1.5,
sizeLoadingsNames = 4,
colLoadingsNames = 'red4',
# other parameters
lab = NULL,
colby = 'ER', colkey = c('ER+'='royalblue', 'ER-'='red3'),
hline = 0, vline = c(-25, 0, 25),
vlineType = c('dotdash', 'solid', 'dashed'),
gridlines.major = FALSE, gridlines.minor = FALSE,
pointSize = 5,
legendPosition = 'none', legendLabSize = 16, legendIconSize = 8.0,
shape = 'Grade', shapekey = c('Grade 1'=15, 'Grade 2'=17, 'Grade 3'=8),
drawConnectors = FALSE,
title = 'PCA bi-plot',
subtitle = 'PC1 versus PC2',
caption = '27 PCs ≈ 80%',
returnPlot = FALSE)
ploadings <- plotloadings(p, rangeRetain = 0.01, labSize = 4,
title = 'Loadings plot', axisLabSize = 12,
subtitle = 'PC1, PC2, PC3, PC4, PC5',
caption = 'Top 1% variables',
shape = 24, shapeSizeRange = c(4, 8),
col = c('limegreen', 'black', 'red3'),
legendPosition = 'none',
drawConnectors = FALSE,
returnPlot = FALSE)
peigencor <- eigencorplot(p,
components = getComponents(p, 1:10),
metavars = c('Study','Age','Distant.RFS','ER',
'GGI','Grade','Size','Time.RFS'),
cexCorval = 1.0,
fontCorval = 2,
posLab = 'all',
rotLabX = 45,
scale = TRUE,
main = "PC clinical correlates",
cexMain = 1.5,
plotRsquared = FALSE,
corFUN = 'pearson',
corUSE = 'pairwise.complete.obs',
signifSymbols = c('****', '***', '**', '*', ''),
signifCutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1),
returnPlot = FALSE)
library(cowplot)
library(ggplotify)
top_row <- plot_grid(pscree, ppairs, pbiplot,
ncol = 3,
labels = c('A', 'B Pairs plot', 'C'),
label_fontfamily = 'serif',
label_fontface = 'bold',
label_size = 22,
align = 'h',
rel_widths = c(1.10, 0.80, 1.10))
bottom_row <- plot_grid(ploadings,
as.grob(peigencor),
ncol = 2,
labels = c('D', 'E'),
label_fontfamily = 'serif',
label_fontface = 'bold',
label_size = 22,
align = 'h',
rel_widths = c(0.8, 1.2))
plot_grid(top_row, bottom_row, ncol = 1,
rel_heights = c(1.1, 0.9))
```
## Make predictions on new data
It is possible to use the variable loadings as part of a matrix calculation
to 'predict' principal component eigenvectors in new data. This is elaborated
in a posting by Pandula Priyadarshana: [How to use Principal Component Analysis (PCA) to make Predictions](https://rpubs.com/PandulaP/PCA_for_Predictions).
The *pca* class, which is created by *PCAtools*, is not configured to work with
*stats::predict*; however, trusty *prcomp* class **is** configured. We can manually
create a *prcomp* object and then use that in model prediction, as elaborated
in the following code chunk:
```{r}
p <- pca(mat, metadata = metadata, removeVar = 0.1)
p.prcomp <- list(sdev = p$sdev,
rotation = data.matrix(p$loadings),
x = data.matrix(p$rotated),
center = TRUE, scale = TRUE)
class(p.prcomp) <- 'prcomp'
# for this simple example, just use a chunk of
# the original data for the prediction
newdata <- t(mat[,seq(1,20)])
predict(p.prcomp, newdata = newdata)[,1:5]
```
# Acknowledgments
The development of *PCAtools* has benefited from contributions
and suggestions from:
* Krushna Chandra Murmu
* Jinsheng
* Myles Lewis
* Anna-Leigh Brown
* Vincent Carey
* Vince Vu
* Guido Hooiveld
* pwwang
* Pandula Priyadarshana
* Barley Rose Collier Harris
* Bob Policastro
# Session info
```{r}
sessionInfo()
```
# References
@PCAtools
@BligheK
@Horn
@Buja
@Lun
@Gabriel