Skip to content

Duplicate Cell Barcodes Trigger Error if Multiple Batches of Data in Combined Data Set #16

@DarioS

Description

@DarioS

There are only a finite number of 10x Genomics cell barcodes; 737280. When data is collected over many months in different sequencing batches, some cell barcodes will recur because, each time, all of the 737280 barcodes are used to label cells. Basically,

Each mapped read in a 10x Genomics Single Cell 3’ v2 Gene Expression Library can be annotated by four labels: (1) A sample barcode, (2) cell-barcode index, (3) Unique Molecular Identifier (UMI) (4) gene ID.
A 16 bp cell-barcode index is randomly selected out of a set containing 737280 possible combinations. In scRNA-seq data, a cell is identified by a unique cell barcode.

set.seed(2024)
January <- sample(737280, 5000) # Patient X
June <- sample(737280, 5000) # Patient Y
> intersect(January, June) # Some simulated barcodes appear twice.
 [1] 378471 685968 517615 550588 543582 255006  83184  47276 697772 584851  17973 577898
[13] 533573    123 384119 518639 591930 295070 238711 401381 171660 184026 210186 708855
[25] 599121 311327 220013 458140 515898 180450 640358 174120 301284 631054

For a real data set, a barcode appears between one and four times.

> allHuman
class: SingleCellExperiment 
dim: 21711 111671

> range(table(colData(allHuman)$Barcode))
  1 4

> table(table(colData(allHuman)$Barcode))
     1      2      3      4 
108977   1275     12     27

scClassify doesn't allow empty column names in the test matrix, which is what read10xCounts produces by default.

allHuman <- logNormCounts(allHuman)
SydneyLogCounts <- assay(allHuman, "logcounts")
> colnames(SydneyLogCounts)
NULL
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
  Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not unique

Because it is multi-batch data, setting column names on the matrix to be Barcodes also fails.

> colnames(SydneyLogCounts) <- colData(allHuman)$Barcode
> head(colnames(SydneyLogCounts))
[1] "AAACCTGAGAAACCTA-1" "AAACCTGAGGACAGAA-1" "AAACCTGCAGACAAAT-1"
[4] "AAACCTGGTACCGAGA-1" "AAACCTGGTCGGGTCT-1" "AAACCTGGTCGTTGTA-1"
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
  Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not unique

The obscure solution is to paste the patient ID to cell barcode to ensure uniqueness.

> colnames(SydneyLogCounts) <- paste(colData(allHuman)$Sample, colData(allHuman)$Barcode, sep = '_')
> head(colnames(SydneyLogCounts))
[1] "DANFRO_CUL1HNP3_AAACCTGAGAAACCTA-1" "DANFRO_CUL1HNP3_AAACCTGAGGACAGAA-1"
[3] "DANFRO_CUL1HNP3_AAACCTGCAGACAAAT-1" "DANFRO_CUL1HNP3_AAACCTGGTACCGAGA-1"
[5] "DANFRO_CUL1HNP3_AAACCTGGTCGGGTCT-1" "DANFRO_CUL1HNP3_AAACCTGGTCGTTGTA-1"
> predicts <- scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts) # Success

This could be much smoother for end-users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions