Classification Error if Bioconductor Atlas Used as Reference

I am recreating an analysis done by researchers from Regeneron Pharmaceuticals published in [Immunostimulatory Cancer-Associated Fibroblast Subpopulations Can Predict Immunotherapy Response in Head and Neck Cancer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9161438/), _Clinical Cancer Research_ but using scClassify instead of SingleR and on in-house data. They used Human Primary Cell Atlas and BLUEPRINT Consortium atlas, as provided by the Biocondcutor package [celldex](https://bioconductor.org/packages/release/data/experiment/html/celldex.html), developed by the same people who developed Bioconductor package [SingleR](https://bioconductor.org/packages/release/bioc/html/SingleR.html) (i.e. Aaron Lun et al.). These atlases are comprised of either purified cell types or cell lines. Human Primary Cell Atlas is developed using 745 microarrays from 105 separate studies that are freely available for download from N.C.B.I. Gene Expression Omnibus.

> A large and diverse set of human primary cell gene expression data was collected, with a particular emphasis on datasets that divided immune cells into sub-populations based upon surface markers.

```r
library(celldex)
HPCAset <- HumanPrimaryCellAtlasData() # Downloads and caches the atlas.
> table(colData(HPCAset)$label.main)
           Astrocyte               B_cell                   BM           BM & Prog. 
                   2                   26                    7                    1 
        Chondrocytes                  CMP                   DC Embryonic_stem_cells 
                   8                    2                   88                   17 
   Endothelial_cells     Epithelial_cells         Erythroblast          Fibroblasts 
                  64                   16                    8                   10
                                   ...                 ... 
```
The other atlas has less samples but is all from one project.

> 259 RNA-seq samples of pure stroma and immune cells as generated and supplied by BLUEPRINT and ENCODE.

```r
BLUEPRINTset <- BlueprintEncodeData()
> table(colData(BLUEPRINTset)$label.main)
       Adipocytes        Astrocytes           B-cells      CD4+ T-cells      CD8+ T-cells 
                9                 2                 8                14                 5 
     Chondrocytes                DC Endothelial cells       Eosinophils  Epithelial cells 
                2                 1                26                 1                18 
     Erythrocytes       Fibroblasts               HSC     Keratinocytes       Macrophages 
                7                20                38                 2                25
                                   ...                 ... 
```
Human Primary Cell Atlas is used as the example data set in [the SingleR vignette](https://bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html#2_Using_built-in_references) and classification works. However, [scClassify](https://bioconductor.org/packages/release/bioc/html/scClassify.html) fails with an error message if either one of these cell type atlases are used.
```r
library(Seurat)
headNeckCells <- readRDS("/dski/nobackup/biostat/datasets/singlecell/headneck/analysis/seurat_supervised_k10_18samples_0.01.rds")
RNAdata <- headNeckCells@assays$RNA@data

> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
Error in scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train
> HPCAclassify <- scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]], RNAdata)
Error in scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train
```
Could this error message be changed into a warning and the classification allowed to proceed? Or, is it important to prevent that? There are some cell types in the atlas with two samples belonging to them. For example, having one sample belonging to Eosinophils triggers the error, but removing singleton classes and even having Keratinocytes class with only two samples belonging to proceeds with classification. Then, another error / `message` is reached.

```r
classCounts <- table(colData(HPCAset)[["label.main"]])
keepSamples <- colData(HPCAset)[["label.main"]] %in% names(classCounts)[classCounts > 1]
> table(keepSamples) # Two samples, each belonging to one class, will be dropped from the atlas.
keepSamples
FALSE  TRUE
    2   711
HPCAset <- HPCAset[, keepSamples] # All classes now have at least two samples.
> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
   There are only 0 selected genes in reference data expressed in query data.
> nrow(HPCAset)
[1] 19363
> length(intersect(rownames(HPCAset), rownames(RNAdata))) # Must be an abundance issue instead of gene name matching.
[1] 17589
```

It seems that the selected genes probably have moderate rather than high abundance and are not detectable by the poor limit-of-detection of single-cell sequencing. May scClassify output the list of selected genes in the error message to allow manual inspection? Also, could scClassify automatically select only useable genes and avoid the unuseable ones during feature selection to prevent ever selecting zero expressed genes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Classification Error if Bioconductor Atlas Used as Reference #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Classification Error if Bioconductor Atlas Used as Reference #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions