-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathWorkflow.Rmd
1731 lines (1523 loc) · 84.9 KB
/
Workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "BASiCS workflow: a step-by-step analysis of expression variability using single cell RNA sequencing data"
author:
- name: Alan O'Callaghan
affiliation:
- &MRC MRC Human Genetics Unit, Institute of Genetics \& Cancer,
University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh,
EH4 2XU, UK
email: "[email protected]"
- name: Nils Eling
affiliation:
- &UZH Department of Quantitative Biomedicine, University of Zurich,
Winterthurerstrasse 190, CH-8057, Zurich, Switzerland
- Ð Institute for Molecular Health Sciences, ETH Zurich,
Otto-Stern Weg 7, 8093 Zurich, Switzerland
- name: John C. Marioni
affiliation:
- &EBI European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- &CRUK Cancer Research UK Cambridge Institute, University of Cambridge,
Li Ka Shing Centre, Cambridge, CB2 0RE, UK
- name: Catalina A. Vallejos
affiliation:
- *MRC
- The Alan Turing Institute, British Library, 96 Euston Road, London,
NW1 2DB, UK
email: "[email protected]"
abstract: |
Cell-to-cell gene expression variability is an inherent feature of complex
biological systems, such as immunity and development. Single-cell RNA
sequencing is a powerful tool to quantify this heterogeneity, but it is prone
to technical noise, beyond what is observed in bulk-level assays.
In this article, we describe a step-by-step
computational workflow that uses the BASiCS Bioconductor package to robustly
quantify expression variability within and between known groups of cells (such
as experimental conditions or cell types). BASiCS uses an integrated framework
for data normalisation, technical noise quantification and downstream
analyses, propagating statistical uncertainty across these steps.
Within a single seemingly homogeneous cell population, BASiCS can identify
highly variable genes that exhibit strong heterogeneity as well as lowly
variable genes with stable expression. BASiCS also uses a probabilistic
decision rule to identify changes in expression variability between cell
populations, whilst avoiding confounding effects related to differences in
technical noise or in overall abundance. Using a publicly available
dataset, we guide users through a complete pipeline that includes
preliminary steps for quality control, as well as data exploration
using the scater and scran Bioconductor packages. The workflow is accompanied
by a Docker image that ensures the reproducibility of our results.
keywords: Single-cell RNA sequencing, expression variability,
transcriptional noise, differential expression testing
bibliography: Workflow.bib
urlcolor: Orange
output:
BiocWorkflowTools::f1000_article:
fig_width: 6
fig_height: 3.5
# output: word_document
csl: american-medical-association.csl
---
```{r setup_knitr, include = FALSE, cache = FALSE}
library("BiocStyle")
## Decide whether to display parts for BioC (TRUE) or F1000 (FALSE)
on.bioc <- FALSE
library("knitr")
library("ggplot2")
theme_set(theme_bw())
## Use fig.width = 7 for html and fig.width = 6 for pdf
## fig.width <- ifelse(on.bioc, 10, 6)
if (knitr::is_html_output()) {
out_width <- "700px"
out_height <- "600px"
} else if (knitr::is_latex_output()) {
out_width <- "2.5in"
out_height <- "3.5in"
}
knitr::opts_chunk$set(
warning = FALSE, message = FALSE, error = FALSE,
cache = 2, cache.path = "cache_main/",
fig.path = "figure/"
)
options(timeout=100)
```
# Introduction
<!--- scRNA-seq and the different types of heterogeneity --->
Single-cell RNA-sequencing (scRNA-seq) enables the study of genome-wide
cell-to-cell transcriptional heterogeneity that is not captured by bulk
experiments [@Stegle2015; @Prakadan2017; @Patange2018].
On the broadest level, this heterogeneity can reflect the presence of distinct
cell subtypes or states.
Alternatively, it can be due to gradual changes along biological processes,
such as development and differentiation.
Several clustering and pseudotime inference tools have been developed to
capture these types of heterogeneity [@Kiselev2019; @Saelens2019].
However, there is a limited availability of tools tailored
to study more subtle variability within seemingly homogeneous cell populations.
This variability can reflect deterministic or stochastic events that regulate
gene expression and, among other settings, has been seen to increase prior to
cell fate decisions [@Mojtahedi2016] and during ageing [@Martinez-jimenez2017].
Transcriptional variability has also been observed to differ from gene to gene
and can be conserved across cell types and species [@Lin2019].
<!--- Narrow focus to transcriptional noise --->
Stochastic variability within a seemingly homogeneous cell population --- often
referred to as transcriptional *noise* --- can arise from intrinsic and
extrinsic sources [@Elowitz2002; @Eling2019].
Extrinsic noise refers to stochastic fluctuations induced by
different dynamic cellular states (e.g. cell cycle, metabolism,
intra/inter-cellular signalling) [@Zopf2013; @Iwamoto2016; @Kiviet2014].
In contrast, intrinsic noise arises from stochastic effects on biochemical
processes such as transcription and translation [@Elowitz2002].
Intrinsic noise can be modulated by genetic and epigenetic modifications (such
as mutations, histone modifications, CpG island length and nucleosome
positioning) [@Eberwine2015; @Faure2017; @Morgan2018] and usually occurs
at the gene level [@Elowitz2002].
Cell-to-cell gene expression variability estimates derived from scRNA-seq data
capture a combination of these effects, as well as deterministic regulatory
mechanisms [@Eling2019].
Moreover, these variability estimates can also be inflated by the technical
noise that is typically observed in scRNA-seq data [@Brennecke2013]. This
technical noise relates to systematic differences between cells that may be
introduced by the data generating process (e.g., due to differences in dilution
or sequencing depth) [@Vallejos2017].
<!--- Experimental strategies to tackle technical noise --->
Different strategies have been incorporated into scRNA-seq protocols to control
or attenuate technical noise.
For example, external RNA spike-in molecules (such as the set introduced by the
External RNA Controls Consortium, ERCC [@Rna2005]) can be added to each cell's
lysate in a (theoretically) known fixed quantity.
Spike-ins can assist quality control steps [@McCarthy2017], data normalisation
[@Vallejos2017] and can be used to infer technical noise [@Brennecke2013].
Another strategy is to tag individual cDNA molecules using unique molecular
identifiers (UMIs) before PCR amplification [@Islam2014].
Reads that contain the same UMI can be collapsed into a single molecule count,
attenuating technical variability associated to cell-to-cell differences
in amplification and sequencing depth (these technical biases are not fully
removed unless sequencing to saturation [@Vallejos2017]).
However, despite the benefits associated to the use of spike-ins and UMIs,
these are not routinely available for all scRNA-seq protocols [@Haque2017].
In particular, spike-ins are of limited use in droplet-based protocols,
as spike-ins can only be added to the reagent mixture in a known concentration,
and the exact quantity in each droplet necessarily remains unknown [@Bacher2016].
<!--- Introduce BASiCS --->
The Bioconductor package `r Biocpkg("BASiCS")` implements a Bayesian
hierarchical framework that accounts for both technical and biological sources
of noise in scRNA-seq datasets [@Vallejos2015; @Vallejos2016; @Eling2017]. The
model was initially motivated by supervised experimental designs in which
experimental conditions correspond to groups of cells defined *a priori* (e.g.
selected cell types obtained through FACS sorting). However, the approach can
also be used in cases where the groups of cells of interest are computationally
identified through clustering. In such cases, the model does not currently
address issues associated to *post-selection inference*, where the same data is
analysed twice: first to perform clustering and then to compare expression
profiles between the clusters [@Lahnemann2020]; this is an inherent limitation
of most differential expression tools.
`r Biocpkg("BASiCS")` jointly performs data normalisation, technical noise
quantification and downstream analyses, whilst propagating statistical
uncertainty across these steps.
These features are implemented within a probabilistic model that builds upon a
negative binomial distribution, a widely used statistical model in the context
of bulk and scRNA-seq experiments [@Love2014;@Svensson2020;@Townes2020;@Townes2019].
The negative binomial distribution is commonly used to model count data when
the observed variability differs from what can be captured by a simpler
Poisson model --- this is typically referred to as over-dispersion.
<!--- emphasis, this is critical --->
Critically, `r Biocpkg("BASiCS")` enables the quantification of transcriptional
variability within a population of cells, while accounting for the overall
mean-variance relationship that typically arises in scRNA-seq data [@Eling2018].
The latter enables the quantification of a gene-level *residual over-dispersion*
as a measure of transcriptional noise that is not confounded by differences in
gene expression. Furthermore, when available, `r Biocpkg("BASiCS")` can also
leverage extrinsic spike-in molecules to aid data normalisation.
A previous study has shown that BASiCS, when used as a generative
model, is well-tuned to capture and recapitulate the main properties of
typical scRNAseq data using posterior predictive checks [@Zappia2017].
<!--- Describes aims for the workflow and introduces BASiCS --->
This article complements existing scRNA-seq workflows based on the Bioconductor
ecosystem (e.g. [@Lun2016; @Kim2019]), providing a detailed framework for
transcriptional variability analyses using `r Biocpkg("BASiCS")`.
We describe a step-by-step workflow that uses `r Biocpkg("scater")`
[@McCarthy2017] and `r Biocpkg("scran")` [@Lun2016] to perform quality control
(QC) as well as initial exploratory analyses.
Our analysis pipeline includes practical guidance to assess the convergence of
the Markov Chain Monte Carlo (MCMC) algorithm that is used to infer model
parameters in `r Biocpkg("BASiCS")`, as well as recommendations to interpret
and post-process the model outputs.
Finally, through a case study in the context of mouse immune cells, we illustrate
how `r Biocpkg("BASiCS")` can identify highly and lowly variable
genes within a cell population, as well as to compare expression profiles
between experimental conditions or cell types.
All source code used to generate the results presented in this article is
available [on Github](https://github.com/VallejosGroup/BASiCSWorkflow).
To ensure the
reproducibility of this workflow, the analysis environment and all software
dependencies are provided as a Docker image [@Boettiger2015]. The image
can be obtained from
[Docker Hub](https://hub.docker.com/r/alanocallaghan/basicsworkflow2020-docker).
# Methods {#methods}
## Implementation
The `r Biocpkg("BASiCS")` Bioconductor package uses a Bayesian hierarchical
model to simultaneously perform data normalisation, technical noise
quantification and downstream analyses [@Vallejos2015;@Vallejos2016;@Eling2018]
within a cell population or populations under study. In this context,
cell populations could correspond to groups set a priori by the experimental
design (e.g. naive or stimulated CD4+ T cells in [@Martinez-jimenez2017]),
or to groups of cells that were computationally identified through clustering.
Moreover, instead of modelling expression patterns separately for
each gene, `r Biocpkg("BASiCS")` shares information between all genes to
robustly quantify transcriptional variability. For example, as described by
[@Eling2018], pooling information across genes with similar mean expression
levels helps to obtain more reliable transcriptional variability estimates. The
latter is particularly helpful for lowly expressed genes and sparse datasets,
where less information is available.
A high-level overview for the model implemented in `r Biocpkg("BASiCS")` is
shown in Figure \@ref(fig:fig1-basics-schematic),
and a more detailed description is provided in the `ExtendedMethods` document
in the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
Model parameters are designed to capture different
aspects of scRNAseq data. First, *nuisance* parameters are used to
capture technical artifacts. This includes cell-specific parameters (indexed by
a $j$ subscript) used to perform global scaling normalisation (similar to the
size factors in `r Biocpkg("scran")`) and global parameters designed to capture
technical over-dispersion (this can also be interpreted as measurement error)
that is shared by all genes. Note that, whilst `r Biocpkg("scran")` infers
cell-specific size factors as a pre-processing step, `r Biocpkg("BASiCS")` uses
an integrated approach wherein data normalisation and downstream analyses are
performed simultaneously, thereby propagating statistical uncertainty. Secondly,
`r Biocpkg("BASiCS")` summarises expression patterns within the population of
cells under study (e.g. a experimental condition or cell type) using two sets
of gene-specific parameters (indexed by a $i$ subscript). Mean parameters
$\mu_i$ quantify overall expression for each gene $i$. In contrast, $\delta_i$
captures the excess of variability that is observed with respect to what would
be expected in a homogeneous cell population, beyond technical noise.
The latter is used as a proxy to quantify transcriptional variability.
To account for the strong relationship that is typically observed between
gene-specific mean expression and over-dispersion estimates, Eling *et al.*
[@Eling2018] introduced a *joint prior* specification for these parameters.
This joint prior assumes that genes with similar mean expression ($\mu_i$) have
similar over-dispersion parameters $\delta_i$. Effectively, this shrinks
over-dispersion estimates towards a global trend that captures the relationship
between mean and over-dispersion (Figure \@ref(fig:fig1-basics-schematic)).
This improves posterior inference for over-dispersion parameters when the data
is less informative (e.g. small sample size, lowly expressed genes)
[@Eling2018]. This information-sharing approach is conceptually similar to
that performed by @Love2014 and others, where sparse data is pooled to obtain
more reliable estimates. The global trend is then used to derive gene-specific
*residual over-dispersion* parameters $\epsilon_i$ that are not confounded by
mean expression. Similar to the DM values implemented in `r Biocpkg("scran")`,
these are defined as deviations with respect to the overall trend.
These residual over-dispersion values can be used to quantify transcriptional
heterogeneity without any confounding with mean expression, though
it should be noted that residual over-dispersion may also arise due to
structural heterogeneity (e.g. distinct cell sub-types or states).
For example, @Eling2018 computationally explored the impact of unobserved
structural heterogeneity by generating artificially contaminated datasets.
In contrast, @Martinez-jimenez2017 showed that changes in transcriptional
heterogeneity can related to aging, where older mice were shown to exhibit
a more "bursting" or stochastic transcriptional pattern relative to younger mice.
```{r fig1-basics-schematic, echo = FALSE, out.width="\\textwidth", fig.cap="A schematic representation of the features of the BASiCS model. BASiCS accounts for cell-to-cell and batch-to-batch variability using nuisance parameters. Accounting for cell-to-cell and batch-to-batch technical variability allows BASiCS to robustly perform gene-level inference of mean and variability. Furthermore, by accounting for the association between mean and variability in scRNAseq, BASiCS can also infer transcriptional noise using the residual variability parameter $\\epsilon_i$."}
knitr::include_graphics("figure/Workflow-Schematic.pdf", auto_pdf = TRUE)
```
Additionally, technical variation is quantified
using replication [@Carroll2005]. In the absence of true technical replicates,
we assume that population-level characteristics of the cells are replicated
using appropriate experimental design. This requires that cells from the same
population have been randomly allocated to different batches. Given appropriate
experimental design, `r Biocpkg("BASiCS")` assumes that biological effects
are shared across batches, while technical variation leads to spurious
differences between cells in different batches. Thus, `r Biocpkg("BASiCS")`
models cell-specific size factors $\nu_j$, and batch-specific technical
dispersion $\theta$.
It is this version of
the model that we focus on here, and that we recommend for most
users. Previous versions of the model are available within the package, but
are primarily useful for reproducibility purposes or for analysing
datasets that contain spike-in genes.
While several differential expression tools have been proposed for scRNA-seq
data (e.g. [@Kharchenko2014; @Finak2015]), some evidence suggests that
these do not generally outperform popular bulk RNA-seq tools [@Soneson2018].
Moreover, most of these methods are only designed to uncover changes in overall
expression, ignoring the more complex patterns that can arise at the single cell
level [@Lahnemann2020].
Instead, `r Biocpkg("BASiCS")` embraces the high granularity of scRNA-seq data,
uncovering changes in cell-to-cell expression variability that are not
confounded by differences in technical noise or in overall expression.
## Operation
This step-by-step scRNA-seq workflow is primarily based on the Bioconductor
package ecosystem [@Amezquita2019], and as such should run on any major
operating system using R $\geq$ 4.0, although results may differ slightly using
different Bioconductor and R versions.
A graphical overview is provided in Figure \@ref(fig:fig2-overview)
and its main components are described below.
```{r fig2-overview, out.width="0.7\\textwidth", fig.cap = "Graphical overview for the scRNA-seq analysis workflow described in this manuscript. Starting from a matrix of expression counts, we use the scater and scran Bioconductor packages to perform QC and initial exploratory analyses. To robustly quantify transcriptional heterogeneity within seemingly homogeneous cell populations, we apply the BASiCS Bioconductor package and illustrate how BASiCS can be used to analyse a single or multiple pre-specified groups of cells.", echo=FALSE}
knitr::include_graphics("figure/Workflow-Overview.pdf", auto_pdf = TRUE)
```
Here, we load all the libraries that will be used throughout this workflow.
First, we load Bioconductor libraries included in Figure \@ref(fig:fig2-overview).
We provide further explanation about their function in the following sections.
```{r}
library("SingleCellExperiment")
library("scater")
library("scran")
library("BASiCS")
```
Next, we load other general purpose libraries used for data visualisation ---
e.g.,
`r CRANpkg("ggplot2")`), colour scales (e.g., `r CRANpkg("viridis")`,
and data manipulation (`r CRANpkg("tidyr")`).
```{r}
# Data visualisation
library("ggplot2")
library("ggpointdensity")
library("patchwork")
library("ComplexHeatmap")
# colour scales
library("viridis")
library("circlize")
library("RColorBrewer")
# data manipulation
library("tidyr")
```
## Input data
We use the Bioconductor package `r Biocpkg("SingleCellExperiment")` to convert
an input matrix of raw read counts (molecule counts for UMI-based protocols)
into a `SingleCellExperiment` object that can also store its associated
metadata, such as gene- and cell-specific information.
Moreover, when available, the same object can also store read counts for
spike-in molecules (see `help("altExp")`).
A major advantage of using `r Biocpkg("SingleCellExperiment")` is that it
enables interoperability across a large number of Bioconductor packages [@Amezquita2019].
## QC and exploratory data analysis
A critical step in scRNA-seq analyses is QC, removing low quality samples that
may distort downstream analyses.
The [*OSCA*](https://bioconductor.org/books/release/OSCA/) online book provides
an extensive overview on important aspects of how to perform QC of scRNA-seq
data, including exploratory analyses [@Amezquita2019].
Generally, we use QC diagnostics to identify and remove samples that correspond to
broken cells, that are empty, or that contain multiple cells [@Ilicic2016].
We also remove lowly expressed genes that represent less reliable
information.
We recommend the Bioconductor package `r Biocpkg("scater")` package [@McCarthy2017] to
calculate QC metrics for each cell (e.g. total read-count) and gene
(e.g. percentage of zeroes across all cells), respectively.
We also recommend the visualisation tools implemented in the `r Biocpkg("scater")` to
explore the input dataset and its associated QC diagnostic metrics.
To perform further exploratory data analysis, we recommend the Bioconductor package
`r Biocpkg("scran")` [@Lun2016].
The latter is used to perform *global scaling* normalisation, calculating
cell-specific scaling factors that capture global differences in read-counts
across cells (e.g. due to sequencing depth and PCR amplification)
[@Lun2016pooling]. Note that the use of `r Biocpkg("scran")` as a normalisation
tool is not a requirement for `r Biocpkg("BASiCS")` (which uses raw read-counts
as input). However, we recommend the use of `r Biocpkg("scran")` normalisation
as part of the exploratory data analysis. For example, normalised read-counts
can be used to visualise the data in a low-dimensional space (e.g. using
principal component analysis). This can help to explore the structure of the
data, e.g. to examine potential batch effects.
# Use cases
## Case study: analysis of presomitic and somitic mesoderm cells {#SM-analysis}
The development of droplet-based scRNA-seq [@Klein2015;@Macosko2015]
lead to a large increase in the number of cells that can be profiled per
experiment at a fixed cost.
With this, large-scale scRNA-seq datasets have been generated to study
development across multiple time-points and capturing multiple tissues
[@Ibarra-Soria2018;@Kernfeld2018].
To illustrate the use of `r Biocpkg("BASiCS")` in this context, we use
droplet-based scRNAseq generated using the 10X microfluidics platform. An
example using a dataset generated using a plate-based protocol including
spike-ins is provided in the `AnalysisCD4T` document of the
Zenodo repository for this manuscript (see [Data availability](#data-availability)).
In particular, we focus on the data generated by @Ibarra-Soria2018 which
contains expression profiles for 20,819 cells collected from E8.25 mouse
embryos. These data comprise a highly heterogeneous population of cells with
over 20 major cell types. The data is stored under the accession number
E-MTAB-6153 on ArrayExpress and can be obtained from
[here](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6153#).
The authors also provide additional information summarising the experimental
design (animal of source for each cell) and the cluster labels generated by
in their analysis. The code required to download these data is provided in
the `DataPreparationBASiCSWorkflow` document of
the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
As explained above, when analysing a heterogeneous population of cells, a
clustering analysis is required to identify the cell types that drive this
heterogeneity prior to running `r Biocpkg("BASiCS")`. `r Biocpkg("BASiCS")` can
then be used to quantify transcriptional variability within each cell type, as
well as to characterise differences between cell types. Here, we use the cluster
labels generated in the original publication. For illustration purposes, we
focus on analysing presomitic and somitic mesoderm cells. This is an arbitrary
choice and the steps described in this workflow could be applied to any other
pair of cell types.
### Preparing the input data for BASiCS
While quality control and pre-processing is an important topic in scRNAseq
analysis, we feel that it has been covered in detail in other articles.
For this analysis, we have performed quality control, pre-processing and
exploratory data analysis in the `DataPreparationBASiCSWorkflow` document of
the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
The code presented there reads the unprocessed raw data and prepares it for
use in this workflow. Here, we load the pre-processed data.
This consists of two separate
`SingleCellExperiment` data objects: one for presomitic and one for
somitic mesoderm cells.
```{r SCE-load}
# Website where the files are located
files_website <- "https://zenodo.org/record/10251224/files/"
# To avoid timeout issues as we are downloading large files
options(timeout = 1000)
# File download
## The code below uses `file.exists` to check if files were previously downloaded
## After download, files are then stored in an `rds` sub-folder
if (!file.exists("rds/sce_sm.Rds")) {
download.file(
paste0(files_website, "sce_sm.Rds"),
destfile = "rds/sce_sm.Rds"
)
}
if (!file.exists("rds/sce_psm.Rds")) {
download.file(
paste0(files_website, "sce_psm.Rds"),
destfile = "rds/sce_psm.Rds"
)
}
# Pre-somitic mesoderm cells
sce_psm <- readRDS("rds/sce_psm.Rds")
# Somitic mesoderm cells
sce_sm <- readRDS("rds/sce_sm.Rds")
```
In brief, we have filtered the original data to select presomitic mesoderm (PSM)
and somitic mesoderm (SM) cells, excluding a small cluster of cells which the
authors identified a potential doublets (where multiple cells are captured in
the same droplet). We also exclude cells with more 25,000 UMI-counts. The
processed dataset contains `r ncol(sce_psm)` pre-somitic and `r ncol(sce_sm)`
somitic mesoderm cells.
Subsequently, we applied a gene filtering step to remove lowly expressed genes
that were largely undetected through sequencing. In particular, we removed genes
that are not detected in at least 20 cells across all cells. This is to ensure a
reliable estimate of variability, and is roughly in line with the sample size
requirements for the negative binomial distribution outlined in @Lloyd-Smith2007.
We also applied a filter to consider protein-coding genes only. After applying
these filters `r nrow(sce_psm)` will be included in our analysis.
Since droplet-based scRNA-seq data are generated without including technical
spike-in genes, BASiCS uses measurement error models to quantify technical
variation through replication [@Carroll1998].
The different experimental batches used by @Ibarra-Soria2018 will be used for
this purpose. In the case of the somitic and pre-somoitic mesoderm cells,
two mouse embryos have been used to generate the data.
Cells isolated from the first embryo were split into two batches and processed
independently. Cells from the second embryo were processed together as a single
batch. These three different groups of cells are used as batches. Numbers of
cells per batch and cell type are displayed below.
```{r show-batches}
# Pre-somitic mesoderm cells
table(sce_psm$sample)
# Somitic mesoderm cells
table(sce_sm$sample)
```
To enable BASiCS to use this information, this should be stored as `BatchInfo`
within the cell-level metadata.
```{r BASiCS-droplet-data}
# Pre-somitic mesoderm cells
colData(sce_psm)$BatchInfo <- colData(sce_psm)$sample
# Somitic mesoderm cells
colData(sce_sm)$BatchInfo <- colData(sce_sm)$sample
```
### Parameter estimation using BASiCS
Parameter estimation is implemented in the `BASiCS_MCMC` function using an
adaptive Metropolis within Gibbs algorithm (see section 3 in [@Roberts2009]).
The primary inputs for `BASiCS_MCMC` correspond to:
- `Data`: a `SingleCellExperiment` object created as described in the
previous sections.
- `N`: the total number of MCMC iterations.
- `Thin`: thining period for output storage
(only the `Thin`-th MCMC draw is stored).
- `Burn`: the initial number of MCMC iterations to be discarded.
- `Regression`: if `TRUE` a joint prior is assigned to $\mu_i$ and $\delta_i$
[@Eling2018], and residual over-dispersion values $\epsilon_i$ are inferred.
Alternatively, independent log-normal priors are assigned to $\mu_i$ and
$\delta_i$ [@Vallejos2016].
- `WithSpikes`: if `TRUE` information from spike-in molecules is used to aid
data normalisation and to quantify technical noise.
- `PriorParam`: Defines the prior hyper-parameters to be used by `r Biocpkg("BASiCS")`.
As a default, we recommend to use `Regression = TRUE`, as the joint prior
introduced by [@Eling2018] leads to more stable estimation,
particularly for small sample sizes and lowly expressed genes.
This approach also enables users to obtain a measure of
transcriptional variability that is not confounded by mean expression.
Advanced users may use the `BASiCS_PriorParam` function
to adjust prior hyperparameters. If `PriorMu = "EmpiricalBayes"`, $\mu_i$'s are assigned a
log-normal prior with gene-specific location hyper-parameters defined via an
empirical Bayes framework. Alternatively, if `PriorMu = "default"`, location
hyper-parameters are set to be equal 0 for all genes. We recommend that users
use the defaults for `PriorParam` for most applications.
We recommend to use `PriorMu = "EmpiricalBayes"` as we have observed
that an empirical Bayes framework [@Casella1985]
improves estimation performance for sparser datasets.
Extra parameters can be used to store the output
(`StoreChains`, `StoreDir`, `RunName`) and to monitor the progress of the
algorithm (`PrintProgress`).
Here, we run the MCMC sampler separately for somitic and pre-somitic mesoderm
cells. We use 30,000 iterations (`N`), discarding the initial 15,000 iterations
as burn-in, (`Burn`), and saving parameter values only once in each 15
iterations (`Thin`).
We recommend this setting for large datasets, as it generally produces reliable
and reproducible results. However, specific datasets may require a larger number
of iterations to achieve convergence. Practical guidance about MCMC convergence
diagnostics is provided in the next section.
```{r MCMC-run, eval = FALSE}
## MCMC results may vary slightly due to pseudorandom number generation.
## We fix a random seed for exact reproducibility,
## but this is not strictly required
set.seed(42)
chain_sm <- BASiCS_MCMC(
Data = sce_sm,
N = 30000,
Thin = 15,
Burn = 15000,
Regression = TRUE,
WithSpikes = FALSE,
PriorParam = BASiCS_PriorParam(sce_sm, PriorMu = "EmpiricalBayes"),
Threads = 4,
StoreChains = TRUE,
StoreDir = "rds/",
RunName = "SM"
)
set.seed(43)
chain_psm <- BASiCS_MCMC(
Data = sce_psm,
N = 30000,
Thin = 15,
Burn = 15000,
Regression = TRUE,
WithSpikes = FALSE,
PriorParam = BASiCS_PriorParam(sce_psm, PriorMu = "EmpiricalBayes"),
Threads = 4,
StoreChains = TRUE,
StoreDir = "rds/",
RunName = "PSM"
)
```
This first of these chains takes 140 minutes to complete on a 3.4 GHz Intel
Core i7 4770k procesor with 32GB RAM, while the second takes 159 minutes.
For convenience, these can be obtained online at
[https://doi.org/10.5281/zenodo.10251224](https://doi.org/10.5281/zenodo.10251224).
```{r download-chain-SM}
# Website were the files are located
chains_website <- "https://zenodo.org/record/10251224/files/"
# To avoid timeout issues as we are downloading large files
options(timeout = 1000)
# File download
## The code below uses `file.exists` to check if files were previously downloaded
## After download, files are then stored in an `rds` sub-folder
if (!file.exists("rds/chain_sm.Rds")) {
download.file(
paste0(chains_website, "chain_sm.Rds"),
destfile = "rds/chain_sm.Rds"
)
}
if (!file.exists("rds/chain_psm.Rds")) {
download.file(
paste0(chains_website, "chain_psm.Rds"),
destfile = "rds/chain_psm.Rds"
)
}
## Loads files after download
chain_sm <- readRDS("rds/chain_sm.Rds")
chain_psm <- readRDS("rds/chain_psm.Rds")
```
The output from `BASiCS_MCMC` is a `BASiCS_Chain` object that contains the
draws associated to all model parameters. Given that
`(N - Burn) / Thin = (30,000 - 15,000) / 15 = 1000`
the object contains 1,000 draws for each parameter.
The matrices of MCMC draws can be accessed using the `displayChainBASiCS`
function --- this
may be useful for estimating and visualising credible intervals using
packages like `r CRANpkg("bayesplot")` or `r CRANpkg("tidybayes")`.
For example, the following code displays the first 2 MCMC draws for mean
expression parameters $\mu_i$ associated to the first 3 genes.
```{r displayChainBASiCS}
displayChainBASiCS(chain_sm, Param = "mu")[1:2, 1:3]
```
### MCMC convergence diagnostics
Before interpreting the estimates generated by `r Biocpkg("BASiCS")`,
it is important to check the performance of the MCMC algorithm used.
This algorithm used to characterise the posterior distribution is stochastic
by nature, and as such its performance may vary even when used on the same
dataset using the same settings. We perform quality control of the MCMC
algorithm in order to ensure that the results we observe are the result of
properties of the underlying data, rather than being an artefact of the
stochastic algorithm used to characterise the data.
As part of performing quality control of MCMC performance,
it is critical to assess the convergence of the MCMC algorithm, i.e. whether
the MCMC reached its stationary distribution.
If convergence has been achieved, the trace for each parameter should not evolve
significantly across iterations, as MCMC draws are expected to be stochastic
fluctuations around a horizontal trend once the sampler has converged to its
stationary distribution.
It is not possible to prove convergence, but multiple graphical and
quantitative convergence diagnostics have been proposed to assess the lack
of convergence (e.g. [@CowlesCarlin1996; @BrooksGelman1998]).
Some advocate the use of multiple MCMC chains using different starting values in
order to ensure that the algorithm consistently converges to the same
distribution and to avoid convergence to local modes.
For `r Biocpkg("BASiCS")`, we have observed that using informed starting values
(e.g. based on `r Biocpkg("scran")` normalisation factors) and a sufficiently
large value for `N` and `Burn` generally leads to largely consistent estimates
across multiple MCMC runs.
Hence, the focus of this section is to evaluate quantitative measures of
convergence (e.g. [@Geweke1995]) based on a single MCMC chain.
For illustration purposes, here we focus on the chains generated for the
somatic mesoderm cells. However, it is important to analyse convergence for
all MCMC chains; we have omitted this analysis for pre-somitic
mesoderm cells in the present manuscript for brevity, but they can be viewed
in the data preparation document (`DataPreparationBASiCSWorkflow`)
in the Zenodo repository for
this manuscript (see [Data availability](#data-availability)).
Traceplots can be used to visually assess the history of MCMC iterations
for a specific parameter (e.g. Figure \@ref(fig:fig3-convergence-SM), left panel).
As mentioned above, significant departures from a horizontal trend suggest a
lack of convergence.
As illustrated in Figure \@ref(fig:fig3-convergence-SM), histograms can also be
used to display the marginal distribution for each parameter.
For `r Biocpkg("BASiCS")`, users should expect these to follow a unimodal
distribution. Failure to satisfy these graphical diagnostics suggest that `N`
and `Burn` must be increased.
Alternatively, more stringent quality control could be applied to the input
data, as we have observed that genes with very low counts often suffer from
slow convergence.
```{r fig3-convergence-SM, fig.cap="Trace plot, marginal histogram, and autocorrelation function of the posterior distribution of the mean expression parameter for a gene in somitic mesoderm cells following MCMC sampling. Trace plots should explore the posterior without getting stuck in one location or drifting over time towards a region of higher density. High autocorrelation indicates that the number of effective independent samples is low. It is good practice to perform these visualisation for many different parameters; here we only show one."}
plot(chain_sm, Param = "mu", Gene = 1)
```
As `r Biocpkg("BASiCS")` infers thousands of parameters, it is
impractical to assess these diagnostics separately for each parameter.
Thus, it is helpful to use numerical diagnostics that can be applied to multiple
parameters simultaneously.
The function `BASiCS_DiagPlot()` can be used to visualise such diagnostic metrics.
Here, we illustrate usage for two such metrics focusing on the MCMC
chain that was obtained for the somitic mesoderm cells (similar results were
obtained for pre-somitic mesoderm cells in the
`DataPreparationBASiCSWorkflow` document of
the Zenodo repository (see [Data availability](#data-availability)).
First, we focus on the diagnostic criterion proposed by Geweke [@Geweke1995].
The latter compares the average of draws obtained during the initial (10\% after
burn in, by default) and the final part of the chain (50\% by default)
by calculating $Z$-scores of the relative difference between the two sets
of draws.
Large absolute $Z$-scores suggest that the algorithm has not converged (as a
rule of thumb, a threshold at $|Z| < 3$ is often applied).
For the somitic and pre-somitic mesoderm datasets, most Z-scores associated to mean
expression parameters $\mu_i$ were small in absolute value
(see Figure \@ref(fig:fig4-mcmc-diag)A), suggesting that the algorithm has
largely converged.
As well as assessing MCMC convergence, it is important to ensure that the MCMC
algorithm has efficiently explored the parameter space.
For example, the autocorrelation function (e.g. Figure
\@ref(fig:fig3-convergence-SM), right panel) quantifies the
correlation between the chain and its lagged versions.
Strong autocorrelation indicates that neighbouring MCMC samples are highly
dependent and suggest poor sampling efficiency.
The latter may indicate that the MCMC draws do not contain
sufficient information to produce accurate posterior estimates.
In other words, highly correlated MCMC samplers require more samples to produce
the same level of Monte Carlo error for an estimate
(defined as the variance of a Monte Carlo estimate across repetitions [@Koehler2009]).
The effective sample size (ESS) is a related measure which represents a proxy
for the number of independent draws generated by the MCMC sampler [@Gelman2014].
The latter is defined as:
$$
\mbox{ESS} = \frac{N_{tot}}{1 + 2\sum_{k=1}^\infty \rho(k)},
$$
where $N_{tot}$ represents the total number of MCMC draws (after burn-in and
thining) and $\rho(k)$ is the autocorrelation at lag $k$.
ESS estimates associated to mean expression parameters for the somitic mesoderm
cells are displayed in Figure \@ref(fig:fig4-mcmc-diag)B.
Whilst ESS is around 1,000 ($N_{tot}$ in this case) for most genes, we observe
low ESS values for a small proportion of genes.
As described later in this manuscript, `BASiCS_TestDE` automatically excludes
genes with low ESS during differential expression testing (by default
a threshold at ESS $< 100$ is applied). However, if a large number of genes have
large Geweke diagnostic values or low effective sample sizes in a certain
dataset, then caution should be applied when interpreting the results of the
model. These issues can often be addressed by more stringent
filtering of genes and cells before performing inference.
```{r fig4-mcmc-diag, fig.cap="Markov chain Monte Carlo diagnostics for gene-specific mean expression parameters; somitic mesoderm cells. A: Geweke Z-score for mean expression parameters is plotted against mean expression estimates. Dashed lines represent absolute Z-scores of 3, outside of which we advise caution when interpreting results. B: Effective sample size (ESS) is plotted against mean expression estimates. A dashed line shows a threshold of 100, below which we advise caution when interpreting results."}
diag_p1 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "geweke") +
theme(legend.position = "bottom")
diag_p2 <- BASiCS_DiagPlot(chain_sm, Param = "mu", Measure = "ess") +
theme(legend.position = "bottom")
diag_p1 + diag_p2 + plot_annotation(tag_levels = "A")
```
### Quantifying transcriptional variability using BASiCS
Studying gene-level transcriptional variability can provide insights about the
regulation of gene expression, and how it relates to the properties of genomic
features (e.g. CpG island composition [@Morgan2018]), transcriptional dynamics
[@Antolovic2017] and aging [@Martinez-jimenez2017], among others.
The squared coefficient of variation (CV^2^) is widely used as a proxy for
transcriptional variability.
For example, we can obtain CV^2^ estimates for each gene using
`r Biocpkg("scran")` normalised counts as input.
In contrast, `r Biocpkg("BASiCS")` infers transcriptional variability using
gene-specific over-dispersion parameters $\delta_i$ (see *Methods*).
Figure \@ref(fig:fig5-dispersion-vs-mean) compares these approaches, focusing on
somitic mesoderm cells (repeating this analysis for pre-somitic mesoderm cells
led to similar results).
For each of the panels in Figure \@ref(fig:fig5-dispersion-vs-mean),
we use the R package `r CRANpkg("ggpointdensity")` to visualise
the local density of genes along the axes.
Note that the `BASiCS_ShowFit`
function can be used to generate Figure \@ref(fig:fig5-dispersion-vs-mean)C, but we
generated the plot manually to demonstrate how users can extract this
information from a `BASiCS_MCMC` object, and for visual consistency with the
other panels.
```{r fig5-dispersion-vs-mean, fig.width=6, fig.height=6, fig.cap="Comparison of gene-specific transcriptional variability estimates and mean expression estimates obtained for each gene using BASiCS and scran. For this analysis, we exclude genes that are not expressed in at least 2 cells. BASiCS estimates for each gene are defined by the posterior median of the associated parameter. scran estimates for each gene are derived after applying the pooling normalisation strategy proposed by Lun et al. Points are coloured according to the local density of genes along the x- and y- axis. A: scran squared CV estimates versus BASiCS estimates for over-dispersion parameters. B: scran estimates for mean expression and the squared CV. C: BASiCS estimates for mean expression and over-dispersion parameters. D: BASiCS estimates for residual over-dispersion parameters versus DM values estimated by scran. E: scran estimates for mean expression and DM values. F: BASiCS estimates for mean expression and residual over-dispersion parameters. Dashed red lines in panels A and D represent the line given by x=y. "}
## Get BASiCS point estimates for mean and variability - SM cells
summary_sm <- Summary(chain_sm)
parameter_df <- data.frame(
mean = displaySummaryBASiCS(summary_sm, Param = "mu")[, 1],
overdisp = displaySummaryBASiCS(summary_sm, Param = "delta")[, 1],
residualoverdisp = displaySummaryBASiCS(summary_sm, Param = "epsilon")[, 1]
)
## Calculate scran size factors - SM cells
sce_sm <- computeSumFactors(sce_sm)
## Get scran estimates for mean and variability - SM cells
sce_sm <- logNormCounts(sce_sm, log = FALSE)
parameter_df$mean_scran <- rowMeans(assay(sce_sm, "normcounts"))
parameter_df$cv2_scran <- rowVars(assay(sce_sm, "normcounts")) /
parameter_df$mean_scran^2
parameter_df$DM <- DM(
mean = parameter_df$mean_scran,
cv2 = parameter_df$cv2_scran
)
## Remove genes without counts in > 2 cells - BASiCS estimates not provided
sel_genes <- !is.na(parameter_df$residualoverdisp)
## The plots below are generated using ggplot2
## We use `plot_params` to specify the format of
## such plots whilst avoiding code duplication
## The plots can be further customized using ggplot2 principles
## For example, we modify x- and y-axis limits in some of the plots below
plot_params <- list(
geom_pointdensity(size = 0.6, na.rm = TRUE),
scale_colour_viridis(name = "Density"),
theme(
text = element_text(size = rel(3)),
legend.position = "bottom",
legend.text = element_text(angle = 90, size = 8, hjust = 0.5, vjust = 0.5),
legend.key.size = unit(0.018, "npc")
)
)
g1 <- ggplot(parameter_df[sel_genes,], aes(log10(cv2_scran), log10(overdisp))) +
xlim(-3, 3) + ylim(-3, 3) +
geom_abline(
slope = 1,
intercept = 0,
colour = "firebrick",
linetype = "dashed"
)
g2 <- ggplot(parameter_df[sel_genes,]) +
aes(log10(mean_scran), log10(cv2_scran)) +
xlim(-3, 2.2) + ylim(-3, 3)
g3 <- ggplot(parameter_df[sel_genes,], aes(log10(mean), log10(overdisp))) +
xlim(-3, 2.2) + ylim(-3, 3)
g4 <- ggplot(parameter_df[sel_genes,], aes(DM, residualoverdisp)) +
geom_abline(
slope = 1,
intercept = 0,
colour = "firebrick",
linetype = "dashed"
)
g5 <- ggplot(parameter_df[sel_genes,], aes(log10(mean_scran), DM))
g6 <- ggplot(parameter_df[sel_genes,], aes(log10(mean), residualoverdisp))
((g1 + g2 + g3) * plot_params) / ((g4 + g5 + g6) * plot_params) +
plot_annotation(tag_levels = "A") & theme(plot.tag = element_text(size = 15))
```
As seen in Figure \@ref(fig:fig5-dispersion-vs-mean)A, CV^2^ and posterior estimates
for $\delta_i$ are highly correlated.
Moreover, both variability metrics are confounded by differences in mean
expression, i.e. highly expressed genes tend to exhibit lower variability
(Figures \@ref(fig:fig5-dispersion-vs-mean)B-C).
We note, however, that `r Biocpkg("scran")` infers a more narrow distribution of
transcriptional variability estimates for any given mean expression level.
To remove the confounding observed in Figures \@ref(fig:fig5-dispersion-vs-mean)B-C,
`r Biocpkg("scran")` and `r Biocpkg("BASiCS")` derive *residual variability*
estimates as deviations with respect to an global mean-variability trend
(see *Methods*).
These are derived using the DM approach [@Kolodziejczyk2015cell] and the
residual over-dispersion parameters $\epsilon_i$ defined by [@Eling2018],
respectively.
For the somitic mesoderm cell data, both approaches led to correlated
estimates (Figure \@ref(fig:fig5-dispersion-vs-mean)D, Pearson's correlation equal
to `r round(with(parameter_df, cor(DM, residualoverdisp, use="complete.obs")), digits=2)`) but the residual
variability estimates generated for these data are less consistent than what we
have observed in the context of less sparse data (see the `AnalysisCD4T`
document in the Zenodo repository for this manuscript (see [Data availability](#data-availability)).
For the majority of genes
(`r with(parameter_df[sel_genes,], sum(sign(DM) == sign(residualoverdisp)))`
out of `r sum(sel_genes)`
captured in at least 3
cells) both approaches led some broadly similar results, inferring the same
sign for their residual variability estimates.
Finally, as expected, we observe that
neither DM values nor posterior estimates for $\epsilon_i$ are
associated with mean expression (Figure \@ref(fig:fig5-dispersion-vs-mean)E-F).
However, unlike the DM method, the integrated approach implemented in
`r Biocpkg("BASiCS")` provides a direct measure of statistical uncertainty for
these estimates via posterior variance.
### HVG/LVG detection using BASiCS
In `r Biocpkg("BASiCS")`, the functions `BASiCS_DetectHVG` and
`BASiCS_DetectLVG` can be used to identify genes with substantially high (HVG)
or low (LVG) transcriptional variability within a population of cells.
If the input `BASiCS_Chain` object was generated by `BASiCS_MCMC` with
`Regression = TRUE` (recommended setting), this analysis is based on the
posterior distribution obtained for gene-specific residual over-dispersion
parameters $\epsilon_i$ (alternatively, the approach introduced by
[@Vallejos2015] can be used).
HVGs are marked as those for which $\epsilon_i$ exceeds a pre-defined threshold
with high probability, where the probability cut-off is chosen to match a given
expected false discovery rate (EFDR; by default the target EFDR is set to 10\%)
[@Newton2004].
The expected false discovery rate we use is conceptually similar to false
discovery rate control procedures in frequentist statistics, such
as the approach of @Benjamini1995, aiming to control the proportion of false
discoveries produced by the procedure.
A similar approach is implemented for LVG detection, but based
on whether $\epsilon_i$ is below a pre-specified threshold `EpsilonThreshold`.
For example, if the threshold for $\epsilon_i$ is equal to $\log(2)$
(the default), HVGs would
be those genes for which the over-dispersion is estimated to be at
least $2$ times higher than would be expected given the inferred mean
expression level, while LVGs would be those genes for which the residual
over-dispersion is at least $2$ time lower than would be expected on the same
basis. In some circumstances, it may be of interest to rank genes and
to select the those with the highest or the lowest residual over-dispersion,
which can be performed using the `PercentileThreshold` parameter; see
`help("BASiCS_DetectVG")` for more details.
```{r det-vg-SM, message=TRUE, fig.cap="HVG and LVG detection using BASiCS. For each gene, BASiCS posterior estimates (posterior medians) associated to mean expression and residual over-dispersion parameters are plotted. Genes are coloured according to HVG/LVG status. Genes that are not expressed in at least 2 cells are excluded."}
## Highly variable genes
hvg <- BASiCS_DetectHVG(chain_sm, EpsilonThreshold = log(2))
## Lowly variable genes
lvg <- BASiCS_DetectLVG(chain_sm, EpsilonThreshold = -log(2))
```
For subsequent visualisation and data exploration, here we merge the output
tables generated above and store these as part of the gene-level metadata
in the `sce_sm` object.
```{r metadata-VG-SM}
## Merge HVG and LVG tables in a single data frame
vg_table <- merge(
as.data.frame(lvg, Filter = FALSE),
as.data.frame(hvg, Filter = FALSE),
by = c("GeneName", "GeneIndex", "Mu", "Delta", "Epsilon"),
suffixes = c("LVG", "HVG")
)
## Mark genes as highly variable, lowly variable, or not either.
vg_table$VG <- "Not HVG or LVG"
vg_table$VG[vg_table$HVG] <- "HVG"
vg_table$VG[vg_table$LVG] <- "LVG"
## Store as gene-level metadata
rowData(sce_sm) <- merge(
rowData(sce_sm), vg_table,
by.x = "ensembl_gene_id", by.y = "GeneName",
sort = FALSE
)
```
Performing HVG and LVG detection involves identifying a posterior probability
threshold that matches a target EFDR. We perform this choice using a
grid search in which EFDR is calculated for a range of different probability
thresholds between 0.5 and 1. As seen in Figure \@ref(fig:fig6-efdr-plot-vg-SM),
for the somitic mesoderm cells, this leads to a threshold of
`r hvg@ProbThreshold` and `r lvg@ProbThreshold` for HVG and LVG detection,
respectively (the value of these thresholds can be extracted from the
`ProbThreshold` slot in the `hvg` and `lvg` objects, e.g. using
`hvg@ProbThreshold`).
For some datasets, the grid search may fail to identify a probability threshold
that matches the target EFDR. In such cases, the default minimum probability
threshold of $\frac{2}{3}$ is used.
This default threshold value can be changed using the `ProbThreshold` to
`BASiCS_DetectHVG` and `BASiCS_DetectLVG`.
```{r fig6-efdr-plot-vg-SM, fig.cap = "Lines representing the EFDR and EFNR for a range of posterior probability thresholds for the HVG (A) and LVG (B) detection tasks. The target EFDR is indicated by dashed horizontal lines, while the chosen posterior probability threshold is indicated by a vertical dashed line."}
p1 <- BASiCS_PlotVG(hvg, Plot = "Grid")
p2 <- BASiCS_PlotVG(lvg, Plot = "Grid")
p1 + p2 +
plot_annotation(tag_levels = "A") +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
```
```{r fig7-plot-vg-SM, fig.cap="HVG and LVG detection using BASiCS. For each gene, BASiCS posterior estimates (posterior medians) associated to mean expression and residual over-dispersion parameters are plotted. Genes are coloured according to HVG/LVG status. Genes that are not expressed in at least 2 cells are excluded."}
plotRowData(sce_sm, x = "Mu", y = "Epsilon", colour_by = "VG") +
scale_x_log10() +
geom_hline(yintercept = c(-log(2), 0, log(2)), lty = 2) +
labs(
x = "BASiCS means (log10)",
y = "BASiCS residual\nover-dispersion"
)
```
For the somitic mesoderm cells data, we obtained
*`r sum(hvg@Table$HVG, na.rm=TRUE)`* HVG and *`r sum(lvg@Table$LVG, na.rm=TRUE)`* LVG.
As shown in Figure \@ref(fig:fig7-plot-vg-SM), these genes are distributed across
a wide range of mean expression values.
As an illustration, Figure \@ref(fig:fig8-plot-example-vg-SM) shows the
distribution of normalised expression values for selected HVG and LVG, focusing
on examples with similar mean expression levels.
As expected, HVG tend to exhibit a wider and potentially bimodal
distribution (Figure \@ref(fig:fig8-plot-example-vg-SM)A).
Instead, LVG tend to have more narrow and unimodal distributions
(Figure \@ref(fig:fig8-plot-example-vg-SM)B).
```{r fig8-plot-example-vg-SM, fig.cap = "BASiCS denoised counts for example HVG (A) and LVG (B) with similar overall levels of expression."}