-
Notifications
You must be signed in to change notification settings - Fork 5
Description
There are only a finite number of 10x Genomics cell barcodes; 737280. When data is collected over many months in different sequencing batches, some cell barcodes will recur because, each time, all of the 737280 barcodes are used to label cells. Basically,
Each mapped read in a 10x Genomics Single Cell 3’ v2 Gene Expression Library can be annotated by four labels: (1) A sample barcode, (2) cell-barcode index, (3) Unique Molecular Identifier (UMI) (4) gene ID.
A 16 bp cell-barcode index is randomly selected out of a set containing 737280 possible combinations. In scRNA-seq data, a cell is identified by a unique cell barcode.
set.seed(2024)
January <- sample(737280, 5000) # Patient X
June <- sample(737280, 5000) # Patient Y
> intersect(January, June) # Some simulated barcodes appear twice.
[1] 378471 685968 517615 550588 543582 255006 83184 47276 697772 584851 17973 577898
[13] 533573 123 384119 518639 591930 295070 238711 401381 171660 184026 210186 708855
[25] 599121 311327 220013 458140 515898 180450 640358 174120 301284 631054For a real data set, a barcode appears between one and four times.
> allHuman
class: SingleCellExperiment
dim: 21711 111671
> range(table(colData(allHuman)$Barcode))
1 4
> table(table(colData(allHuman)$Barcode))
1 2 3 4
108977 1275 12 27scClassify doesn't allow empty column names in the test matrix, which is what read10xCounts produces by default.
allHuman <- logNormCounts(allHuman)
SydneyLogCounts <- assay(allHuman, "logcounts")
> colnames(SydneyLogCounts)
NULL
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not uniqueBecause it is multi-batch data, setting column names on the matrix to be Barcodes also fails.
> colnames(SydneyLogCounts) <- colData(allHuman)$Barcode
> head(colnames(SydneyLogCounts))
[1] "AAACCTGAGAAACCTA-1" "AAACCTGAGGACAGAA-1" "AAACCTGCAGACAAAT-1"
[4] "AAACCTGGTACCGAGA-1" "AAACCTGGTCGGGTCT-1" "AAACCTGGTCGTTGTA-1"
> scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts)
Error in predict_scClassifySingle: colnames of exprsMat_test is NULL or not uniqueThe obscure solution is to paste the patient ID to cell barcode to ensure uniqueness.
> colnames(SydneyLogCounts) <- paste(colData(allHuman)$Sample, colData(allHuman)$Barcode, sep = '_')
> head(colnames(SydneyLogCounts))
[1] "DANFRO_CUL1HNP3_AAACCTGAGAAACCTA-1" "DANFRO_CUL1HNP3_AAACCTGAGGACAGAA-1"
[3] "DANFRO_CUL1HNP3_AAACCTGCAGACAAAT-1" "DANFRO_CUL1HNP3_AAACCTGGTACCGAGA-1"
[5] "DANFRO_CUL1HNP3_AAACCTGGTCGGGTCT-1" "DANFRO_CUL1HNP3_AAACCTGGTCGTTGTA-1"
> predicts <- scClassify(referenceLogCounts, colData(aReference)[, "cellType"], SydneyLogCounts) # SuccessThis could be much smoother for end-users.