Skip to content

Classification Error if Bioconductor Atlas Used as Reference #13

@DarioS

Description

@DarioS

I am recreating an analysis done by researchers from Regeneron Pharmaceuticals published in Immunostimulatory Cancer-Associated Fibroblast Subpopulations Can Predict Immunotherapy Response in Head and Neck Cancer, Clinical Cancer Research but using scClassify instead of SingleR and on in-house data. They used Human Primary Cell Atlas and BLUEPRINT Consortium atlas, as provided by the Biocondcutor package celldex, developed by the same people who developed Bioconductor package SingleR (i.e. Aaron Lun et al.). These atlases are comprised of either purified cell types or cell lines. Human Primary Cell Atlas is developed using 745 microarrays from 105 separate studies that are freely available for download from N.C.B.I. Gene Expression Omnibus.

A large and diverse set of human primary cell gene expression data was collected, with a particular emphasis on datasets that divided immune cells into sub-populations based upon surface markers.

library(celldex)
HPCAset <- HumanPrimaryCellAtlasData() # Downloads and caches the atlas.
> table(colData(HPCAset)$label.main)
           Astrocyte               B_cell                   BM           BM & Prog. 
                   2                   26                    7                    1 
        Chondrocytes                  CMP                   DC Embryonic_stem_cells 
                   8                    2                   88                   17 
   Endothelial_cells     Epithelial_cells         Erythroblast          Fibroblasts 
                  64                   16                    8                   10
                                   ...                 ... 

The other atlas has less samples but is all from one project.

259 RNA-seq samples of pure stroma and immune cells as generated and supplied by BLUEPRINT and ENCODE.

BLUEPRINTset <- BlueprintEncodeData()
> table(colData(BLUEPRINTset)$label.main)
       Adipocytes        Astrocytes           B-cells      CD4+ T-cells      CD8+ T-cells 
                9                 2                 8                14                 5 
     Chondrocytes                DC Endothelial cells       Eosinophils  Epithelial cells 
                2                 1                26                 1                18 
     Erythrocytes       Fibroblasts               HSC     Keratinocytes       Macrophages 
                7                20                38                 2                25
                                   ...                 ... 

Human Primary Cell Atlas is used as the example data set in the SingleR vignette and classification works. However, scClassify fails with an error message if either one of these cell type atlases are used.

library(Seurat)
headNeckCells <- readRDS("/dski/nobackup/biostat/datasets/singlecell/headneck/analysis/seurat_supervised_k10_18samples_0.01.rds")
RNAdata <- headNeckCells@assays$RNA@data

> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
Error in scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train
> HPCAclassify <- scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]], RNAdata)
Error in scClassify(assays(BLUEPRINTset)[["logcounts"]], colData(BLUEPRINTset)[["label.main"]],: 
  There is cell type with only one cell, please check cellTypes_train

Could this error message be changed into a warning and the classification allowed to proceed? Or, is it important to prevent that? There are some cell types in the atlas with two samples belonging to them. For example, having one sample belonging to Eosinophils triggers the error, but removing singleton classes and even having Keratinocytes class with only two samples belonging to proceeds with classification. Then, another error / message is reached.

classCounts <- table(colData(HPCAset)[["label.main"]])
keepSamples <- colData(HPCAset)[["label.main"]] %in% names(classCounts)[classCounts > 1]
> table(keepSamples) # Two samples, each belonging to one class, will be dropped from the atlas.
keepSamples
FALSE  TRUE
    2   711
HPCAset <- HPCAset[, keepSamples] # All classes now have at least two samples.
> HPCAclassify <- scClassify(assays(HPCAset)[["logcounts"]], colData(HPCAset)[["label.main"]], RNAdata)
   There are only 0 selected genes in reference data expressed in query data.
> nrow(HPCAset)
[1] 19363
> length(intersect(rownames(HPCAset), rownames(RNAdata))) # Must be an abundance issue instead of gene name matching.
[1] 17589

It seems that the selected genes probably have moderate rather than high abundance and are not detectable by the poor limit-of-detection of single-cell sequencing. May scClassify output the list of selected genes in the error message to allow manual inspection? Also, could scClassify automatically select only useable genes and avoid the unuseable ones during feature selection to prevent ever selecting zero expressed genes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions