diff --git a/.github/workflows/spell-check.yml b/.github/workflows/spell-check.yml index 11589d18..d21366e1 100644 --- a/.github/workflows/spell-check.yml +++ b/.github/workflows/spell-check.yml @@ -1,7 +1,6 @@ name: Spell check Markdown files # Controls when the action will run. -# Pull requests to master only. on: pull_request: branches: diff --git a/components/dictionary.txt b/components/dictionary.txt index 97935780..d2f6275e 100644 --- a/components/dictionary.txt +++ b/components/dictionary.txt @@ -19,6 +19,7 @@ cellhash cellhashing CELLxGENE CHANGELOG +CNV confounders CZI CZI's @@ -47,6 +48,7 @@ hashedDrops HDF hexamer HBC +Heimberg Hemberg HTO HTODemux @@ -54,6 +56,7 @@ HTOs Hippen HVG HVGs +inferCNV intronic introns isotype @@ -65,12 +68,16 @@ Lun Marioni miQC MLA +MuData +Neuroblastoma oligo oligos +OpenScPCA Pearson Prasad pre preprint +programmatically pseudocount pseudogenes Looney diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index e6c1220d..4c3791be 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -12,6 +12,27 @@ For more information about `AlexsLemonade/scpca-nf` versions, please see [the re +## 2025.12.04 + +All data on the Portal has been updated to include a number of new features. + +* Doublet detection was run on all samples using [`scDblFinder`](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html). +No doublets were filtered, but the results from `scDblFinder` are present in the filtered and processed objects. +* Updated cell type annotations + * All samples include cell type annotations obtained from [`SCimilarity`](https://genentech.github.io/scimilarity/), in addition to the existing cell type annotations from `SingleR` and `CellAssign`. + * Consensus cell types have been updated to incorporate `SCimilarity` results. + If two of the three automated methods agree using an ontology-based approach, a consensus cell type is assigned. + * See our {ref}`documentation on cell type annotation` for more information on these updated cell types. +* Cell types were annotated as part of the ongoing [OpenScPCA project](https://openscpca.readthedocs.io) for `SCPCP000004` (Neuroblastoma) and `SCPCP000015` (Ewing sarcoma). +These cell types are now included in all objects for those samples. + * For more information see our {ref}`documentation on OpenScPCA cell types`. +* CNV inference was performed using [`InferCNV`](https://github.com/broadinstitute/infercnv) on all samples with at least 100 non-malignant reference cells, as identified by the consensus cell types. + * See our {ref}`documentation on CNV inference ` + * For more information on where to find the inferCNV results in the downloaded objects see {ref}`the single-cell gene expression file contents page` and {ref}`the merged object file contents page`. + +In addition to these new features, data from the Portal can now be downloaded programmatically using the new [`ScPCAr` package](https://alexslemonade.github.io/ScPCAr). +See an example in {ref}`our documentation `. + ## 2025.07.25 * Previously, the `cell_id` column in the cell metadata for merged objects was incorrectly formatted. diff --git a/docs/download_files.md b/docs/download_files.md index 9b61ac56..2eaea806 100644 --- a/docs/download_files.md +++ b/docs/download_files.md @@ -16,9 +16,9 @@ Note that multiplexed sample libraries are only available as `SingleCellExperime See the {ref}`FAQ section about samples and libraries ` for more information. The files shown below will be included with each library (example shown for a library with ID `SCPCL000000`): -- An unfiltered counts file: `SCPCL000000_unfiltered.rds` or `SCPCL00000_unfiltered_rna.h5ad`, -- A filtered counts file: `SCPCL000000_filtered.rds` or `SCPCL00000_filtered_rna.h5ad`, -- A processed counts file: `SCPCL000000_processed.rds` or `SCPCL00000_processed_rna.h5ad`, +- An unfiltered counts file: `SCPCL000000_unfiltered.rds` or `SCPCL000000_unfiltered_rna.h5ad`, +- A filtered counts file: `SCPCL000000_filtered.rds` or `SCPCL000000_filtered_rna.h5ad`, +- A processed counts file: `SCPCL000000_processed.rds` or `SCPCL000000_processed_rna.h5ad`, - A quality control report: `SCPCL000000_qc.html`, - A supplemental cell type report: `SCPCL000000_celltype-report.html` @@ -188,7 +188,7 @@ This file contains the raw and normalized counts data for cell barcodes that hav In addition to the counts matrices, the `SingleCellExperiment` or `AnnData` object stored in the file includes the results of dimensionality reduction using both principal component analysis (PCA) and UMAP. See {ref}`Single-cell gene expression file contents ` for more information about the contents of the `SingleCellExperiment` and `AnnData` objects and the included statistics and metadata. -See also {ref}`Using the provided RDS files in R ` and {ref}`Using the provided H5AD files in Python `. +See also {ref}`Getting started with an ScPCA dataset `. ## QC report @@ -325,3 +325,40 @@ The `SCPCP000000_bulk_quant.tsv` file contains a gene by sample matrix (each row The `SCPCP000000_bulk_metadata.tsv` file contains associated metadata for all samples with bulk RNA-seq data. This file will contain fields equivalent to those found in the `single-cell_metadata.tsv` related to processing the sample, but will not contain patient or disease specific metadata (e.g. `age`, `sex`, `diagnosis`, `subdiagnosis`, `tissue_location`, or `disease_timing`). See also {ref}`processing bulk RNA samples `. + +## Programmatic downloads from the ScPCA Portal + +We provide an R package, [`ScPCAr`](https://alexslemonade.github.io/ScPCAr/), to facilitate programmatic access to the ScPCA Portal. +This package allows you to search for and download data from the ScPCA Portal directly within R. + +An example of basic usage of the `ScPCAr` package follows: + +```r +library(ScPCAr) + +# First, look at the terms of use +view_terms() + +# Get an authentication token for use with the ScPCA Portal +auth_token <- get_auth(email = "your.email@example.com", agree = TRUE) + +# Get the sample metadata for a project +sample_metadata <- get_sample_metadata(project_id = "SCPCP000001") + +# Download data for a sample +# this function returns a vector of the downloaded file paths +file_paths <- download_sample( + sample_id = "SCPCS000001", + auth_token = auth_token, + destination = "scpca_data", + format = "sce" +) + +# select and read in the processed SingleCellExperiment object +processed_data <- grep("_processed.rds$", file_paths) +sce <- readRDS(processed_data) +``` + +Please see the [package documentation](https://alexslemonade.github.io/ScPCAr/) for more details about installation and usage. +Source code for the package can be found on [GitHub](https://github.com/AlexsLemonade/ScPCAr). +Information about working with downloaded files can be found in our {ref}`Getting started with an ScPCA dataset ` guide. diff --git a/docs/faq.md b/docs/faq.md index 3985d72e..963ad859 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -44,11 +44,11 @@ This download includes H5AD files that can be directly read into Python. _Note: You will need to install the [`AnnData` package](https://anndata.readthedocs.io/en/latest/index.html) to work with the provided files._ -To read in the H5AD files you can use the `readh5ad` function from the `AnnData` package. +To read in the H5AD files you can use the `read_h5ad` function from the `AnnData` package. ```python import anndata -scpca_sample = anndata.readh5ad(file = "SCPCL000000_processed_rna.h5ad") +scpca_sample = anndata.read_h5ad(file = "SCPCL000000_processed_rna.h5ad") ``` A full description of the contents of the `AnnData` object can be found in the section on {ref}`Components of an AnnData object `. @@ -69,6 +69,30 @@ There are two types of samples where `AnnData` objects are not available: Therefore, we do not currently provide any multiplexed libraries as `AnnData` objects. - In addition, providing multiplexed data in this form is not compliant with the standards for [CZI's CELLxGENE](https://cellxgene.cziscience.com), which we have tried to match as closely as possible. + +## What if I want to use MuData instead of AnnData objects? + +[`MuData` objects](https://mudata.readthedocs.io/en/latest/index.html) are Python objects built on top of `AnnData` objects that are specifically used to store multimodal data. +Currently, we provide RNA counts and ADT counts, if present, as separate `AnnData` objects in their own H5AD files, as described in {ref}`the file contents documentation`. +However, these objects can be combined into a `MuData` object if desired for a multimodal analysis. + +_Note: You will need to install the [`MuData` package](https://mudata.readthedocs.io/en/latest/index.html) to generate and work with `MuData` objects._ + +```python +import anndata +import mudata + +# Read individual AnnData files +rna_object = anndata.read_h5ad(file = "SCPCL000000_processed_rna.h5ad") +adt_object = anndata.read_h5ad(file = "SCPCL000000_processed_adt.h5ad") + +# Combine into a MuData object, using keys "RNA" and "ADT" to distinguish modalities +mdata_object = mudata.MuData({"RNA": rna_object, "ADT": adt_object}) +``` + +For more information on working with `AnnData` objects, see {ref}`Getting started with an ScPCA dataset `. + + ## What is the difference between samples and libraries? A sample ID, labeled as `scpca_sample_id` and indicated by the prefix `SCPCS`, represents a unique tissue that was collected from a participant. @@ -132,12 +156,29 @@ You can find the [function for generating a QC report](https://github.com/AlexsL ## Which libraries include cell type annotations? Most single-cell and single-nuclei RNA-seq libraries available on the portal will have cell type annotations included in the processed `SingleCellExperiment` or `AnnData` object. -For more information on where to find the cell type annotations, refer to section(s) describing {ref}`SingleCellExperiment file contents ` and/or {ref}`AnnData file contents `. +For more information on where to find the cell type annotations, refer to section(s) describing {ref}`SingleCellExperiment file contents ` and/or {ref}`AnnData file contents `. If cell type annotation was performed, a supplemental cell type report (`SCPCL000000_celltype-report.html`) will be included in the download. Cell type annotation is not performed on samples derived from cell lines. This means processed objects will not include cell type annotations, and the download will not include a cell type report. +## Which libraries include CNV inferences? + +As with cell type annotation, most single-cell and single-nuclei RNA-seq libraries available on the portal will have {ref}`CNV inferences` in the processed `SingleCellExperiment` or `AnnData` object. +For more information on where to find these results, refer to sections describing {ref}`SingleCellExperiment cell metrics ` and {ref}`SingleCellExperiment metadata `, and/or {ref}`AnnData cell metrics ` and {ref}`AnnData metadata `. + +There are several circumstances when CNV results are not available: + +* CNV inference is not performed on libraries which do not have enough cells to include in a normal reference, as described in the {ref}`CNV inference processing documentation` +* CNV inference is not performed on libraries derived from cell line or non-cancerous samples +* If `inferCNV` experienced a failure while running, there will not be any associated results in the processed objects + +## Where can I find the inferCNV heatmap? + +For libraries that underwent CNV inference, the [`inferCNV` heatmap depicting expression across genomic regions](https://github.com/broadinstitute/inferCNV/wiki/Interpreting-the-figure) is embedded in the final QC report. +You can directly copy the figure from the QC report file for use in other contexts. + + ## What if I want to use Seurat instead of Bioconductor? The RDS files available for download contain [`SingleCellExperiment` objects](http://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html). @@ -249,8 +290,14 @@ Download links expire in 7 days, but you can generate a new link on the ScPCA Po Download links are only available for projects (i.e., not for downloading individual samples). +## Can I download data from the Portal programmatically? ## Why can't I change the data format in My Dataset? +We provide an R package, [`ScPCAr`](https://alexslemonade.github.io/ScPCAr/), to facilitate programmatic access to the ScPCA Portal. +This package allows you to search for and download data from the ScPCA Portal directly within R. +Please see the [package documentation](https://alexslemonade.github.io/ScPCAr/) for more details about installation and usage. +Source code for the package can be found on [GitHub](https://github.com/AlexsLemonade/ScPCAr). + When creating a {ref}`custom dataset for download` (`My Dataset`), all single-cell sample or project data included must be of the same {ref}`data format`, either `SingleCellExperiment` for use in R or `AnnData` for use in Python. We currently do not support including both data formats at once in `My Dataset`. Once a sample or project of a given data format has been added to `My Dataset`, all subsequent single-cell or single-nuclei data added will automatically be in that same format. diff --git a/docs/getting_started.md b/docs/getting_started.md index 50d00e5a..2f4095a0 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -408,7 +408,7 @@ subsetted_adata_merged_object = adata_merged_object[adata_merged_object.obs["lib ``` The merged object additionally contains metadata such as information about sample diagnosis, subdiagnosis, or tissue location that may be useful for subsetting. -A full set of merged object contents which can support subsetting {ref}`is available here`. +A full set of merged object contents which can support subsetting is available in {ref}`this documentation`. As one example, to subset a `SingleCellExperiment` merged object to a given diagnosis, use the following R code: @@ -489,7 +489,7 @@ Be aware that the processed objects have been filtered to remove low-quality cel ### Filtering cells based on ADT quality control The `adt_scpca_filter` column indicates which cells should be removed before proceeding with downstream analyses of the ADT data, as determined by [`DropletUtils::CleanTagCounts()`](https://rdrr.io/github/MarioniLab/DropletUtils/man/cleanTagCounts.html). -This process identified cells with high levels of ambient contamination and/or high levels of negative control ADTs (if available). +This process identified low-quality cells as those with either very low or high levels of ambient contamination and/or negative control ADTs (if available). Cells are labeled either as `"Keep"` (cells to retain) or `"Remove"` (cells to filter out). To filter cells based on this column in the `SingleCellExperiment` objects, use the following command: @@ -520,6 +520,23 @@ Here are some additional resources that can be used for working with ADT counts - [Integrating with Protein Abundance, Orchestrating Single Cell Analysis](http://bioconductor.org/books/3.15/OSCA.advanced/integrating-with-protein-abundance.html) - [Seurat vignette on using with multimodal data](https://satijalab.org/seurat/articles/multimodal_vignette.html) +### Using MuData objects for multimodal analysis + +It is also possible to combine the given RNA and ADT `AnnData` objects into a single [`MuData` object](https://mudata.readthedocs.io/en/latest/index.html) for multimodal analysis, as shown below. + +_Note: You will need to install the [`MuData` package](https://mudata.readthedocs.io/en/latest/index.html) to generate and work with `MuData` objects._ + +```python +import anndata +import mudata + +# Read individual AnnData files +rna_object = anndata.read_h5ad(file = "SCPCL000000_processed_rna.h5ad") +adt_object = anndata.read_h5ad(file = "SCPCL000000_processed_adt.h5ad") + +# Combine into a MuData object, using keys "RNA" and "ADT" to distinguish modalities +mdata_object = mudata.MuData({"RNA": rna_object, "ADT": adt_object}) +``` ## Special considerations for multiplexed samples diff --git a/docs/merged_objects.md b/docs/merged_objects.md index adec1b5f..35499ed3 100644 --- a/docs/merged_objects.md +++ b/docs/merged_objects.md @@ -75,10 +75,14 @@ Columns representing quality control statistics were calculated using the [`scut | `miQC_pass` | Indicates whether the cell passed the default miQC filtering. `TRUE` is assigned to cells with a low probability of being compromised (`prob_compromised` < 0.75) or [sufficiently low mitochondrial content](https://bioconductor.org/packages/release/bioc/vignettes/miQC/inst/doc/miQC.html#preventing-exclusion-of-low-mito-cells) | | `scpca_filter` | Labels cells as either `Keep` or `Remove` based on filtering criteria (`prob_compromised` < 0.75 and number of unique genes detected > 200) | | `adt_scpca_filter` | If CITE-seq was performed, labels cells as either `Keep` or `Remove` based on ADT filtering criteria (`discard = TRUE` as determined by [`DropletUtils::CleanTagCounts()`](https://rdrr.io/github/MarioniLab/DropletUtils/man/cleanTagCounts.html)) | +| `scDblFinder_class` | The `scDblFinder` predicted classification, either "singlet" or "doublet" | +| `scDblFinder_score` | The `scDblFinder` doublet score, representing the probability that the cell is a doublet | | `submitter_celltype_annotation` | If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as `"Submitter-excluded"` | +| `openscpca_celltype_annotation` | If available, cell type annotations obtained from the [OpenScPCA project](https://openscpca.readthedocs.io/en/latest/) as determined by analysis performed in the [`OpenScPCA-analysis` GitHub repository](https://github.com/AlexsLemonade/OpenScPCA-analysis). Cells that were not annotated as part of OpenScPCA are labeled as `"openscpca-excluded"` | +| `openscpca_celltype_ontology` | If available, the Cell Ontology identifier associated with the `openscpca_celltype_annotation`. `NA` will be used if there is no appropriate cell ontology term | Unlike for {ref}`individual SCE objects`, cluster assignments are not included in the `colData`. -Further, if cell type annotation was performed on at least one library included in the merged object, there will be additional columns representing annotation results in the merged object's `colData`, as described in the {ref}`cell type annotation processing section `. +Further, if cell type annotation was performed on at least one library included in the merged object, there will be additional `colData` columns with these annotation results, as described in the {ref}`cell type annotation processing section `. | Column name | Contents | @@ -88,8 +92,13 @@ Further, if cell type annotation was performed on at least one library included | `cellassign_celltype_annotation` | If cell typing with `CellAssign` was performed, the annotated cell type. Cells labeled as `"other"` are those which `CellAssign` could not confidently annotate. If cell typing was performed for some but not all libraries in the merged object, libraries without annotations will be labeled `"Cell type annotation not performed"`. If `CellAssign` was unable to complete successfully for a given library, cells will be labeled as `"Not run"` | | `cellassign_celltype_ontology` | If cell typing with `CellAssign` was performed, the annotated cell type ontology ID. Cells labeled as `NA` are those which `CellAssign` could not confidently annotate. If cell typing was performed for some but not all libraries in the merged object, libraries without annotations will be labeled `"Cell type annotation not performed"` | | `cellassign_max_prediction` | If cell typing with `CellAssign` was performed and completed successfully, the annotation's prediction score (probability) | -| `consensus_celltype_annotation` | The assigned consensus cell type annotation as determined by finding the latest common ancestor between the `SingleR` and `CellAssign` cell type annotations. Cells labeled as "Unknown" do not have an appropriate consensus cell type label that could be assigned. If the consensus cell type was assigned for some but not all libraries in the merged object, libraries without consensus annotations will be labeled `"No consensus cell type assigned"`| -| `consensus_celltype_ontology` | The assigned consensus cell type ontology ID as determined by finding the latest common ancestor between the `SingleR` and `CellAssign` cell type annotations. Cells labeled as `NA` do not have an appropriate consensus cell type label that could be assigned. If the consensus cell type was assigned for some but not all libraries in the merged object, libraries without consensus annotations will be labeled `"No consensus cell type assigned"` | +| `scimilarity_celltype_annotation` | If cell typing with `SCimilarity` was performed, the annotated cell type. Cells labeled as `"Unclassified cell"` did not undergo cell type annotation with `SCimilarity` | +| `scimilarity_celltype_ontology` | If cell typing with `SCimilarity` was performed, the annotated cell type ontology ID. | +| `scimilarity_min_distance` | If cell typing with `SCimilarity` was performed and completed successfully, the minimum distance of the cell to all cells in the training set | +| `consensus_celltype_annotation` | The assigned consensus cell type annotation as determined by finding the latest common ancestor among `SingleR`, `CellAssign`, and `SCimilarity` cell type annotations. Cells labeled as `"Unknown"` do not have an appropriate consensus cell type label that could be assigned. If the consensus cell type was assigned for some but not all libraries in the merged object, libraries without consensus annotations will be labeled `"No consensus cell type assigned"`| +| `consensus_celltype_ontology` | The assigned consensus cell type ontology ID as determined by finding the latest common ancestor among `SingleR`, `CellAssign`, and `SCimilarity` cell type annotations. Cells labeled as `NA` do not have an appropriate consensus cell type label that could be assigned. If the consensus cell type was assigned for some but not all libraries in the merged object, libraries without consensus annotations will be labeled `"No consensus cell type assigned"` | +| `is_infercnv_reference` | Whether the cell would be considered part of the reference normal cells for `inferCNV` inference. Only present when consensus cell types were assigned and CNV inference was attempted. This will be `NA` for any samples obtained from normal or non-cancerous tissue | +| `infercnv_total_cnv` | If CNV inference was performed, the total number of CNV events in the cell calculated by `inferCNV`. Cells labeled `NA` were either filtered out during `inferCNV` calculations, as described in the {ref}`CNV inference processing section `, or may be the result of `inferCNV` result caching. This column is only present if `inferCNV` was successfully run | @@ -164,10 +173,13 @@ Each such list will contain the following fields: | `scpca_filter_method` | Method used by the Data Lab to filter low quality cells prior to normalization. Either `miQC` or `Minimum_gene_cutoff` | | `adt_scpca_filter_method` | If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either `cleanTagCounts with isotype controls` or `cleanTagCounts without isotype controls`. If filtering failed (i.e. `DropletUtils::cleanTagCounts()` could not reliably determine which cells to filter), the value will be `No filter` | | `min_gene_cutoff` | The minimum cutoff for the number of unique genes detected per cell | -| `normalization` | The method used for normalization of raw RNA counts. Either `deconvolution`, described in [Lun, Bach, and Marioni (2016)](https://doi.org/10.1186/s13059-016-0947-7), or `log-normalization` | +| `normalization` | The method used for normalization of raw RNA counts. Either `deconvolution` or `log-normalization`, as explained in the {ref}`processed gene expression data section `. Only present for `processed` objects | | `adt_normalization` | If CITE-seq was performed, the method used for normalization of raw ADT counts. Either `median-based` or `log-normalization`, as explained in the {ref}`processed ADT data section ` | | `highly_variable_genes` | A list of highly variable genes used for dimensionality reduction, determined using `scran::modelGeneVar` and `scran::getTopHVGs` | -| `celltype_methods` | If cell type annotation was performed, a vector of the methods used for annotation. May include `"submitter"`, `"singler"` and/or `"cellassign"` | +| `celltype_methods` | If cell type annotation was performed, a vector of the methods used for annotation. May include `"submitter"`, `"openscpca"`, `"singler"` and/or `"cellassign"` | +| `openscpca_celltype_module_name` | If cell type annotations from the OpenScPCA project are available, the original module name from the [`OpenScPCA-analysis` GitHub repository](https://github.com/AlexsLemonade/OpenScPCA-analysis) | +| `openscpca_celltype_nf_version` | If cell type annotations from the OpenScPCA project are available, the version of the [`OpenScPCA-nf` workflow](https://github.com/AlexsLemonade/OpenScPCA-nf) used to generate annotations | +| `openscpca_celltype_release_date` | If cell type annotations from the OpenScPCA project are available, the release date for the input ScPCA data used when assigning annotations | | `singler_results` | If cell typing with `SingleR` was performed, the full result object returned by `SingleR` annotation | | `singler_reference` | If cell typing with `SingleR` was performed, the name of the reference dataset used for annotation | | `singler_reference_label` | If cell typing with `SingleR` was performed, the name of the label in the reference dataset used for annotation | @@ -178,6 +190,14 @@ Each such list will contain the following fields: | `cellassign_reference_organs` | If cell typing with `CellAssign` was performed and completed successfully, a comma-separated list of organs and/or tissue compartments from which marker genes were obtained to create the reference | | `cellassign_reference_source` | If cell typing with `CellAssign` was performed and completed successfully, the source of the reference dataset (default is [`PanglaoDB`](https://panglaodb.se/)) | | `cellassign_reference_version` | If cell typing with `CellAssign` was performed and completed successfully, the version of the reference dataset source. For references obtained from `PanglaoDB`, the version scheme is a date in ISO8601 format | +| `scimilarity_model` | If cell typing with `SCimilarity` was performed, the name of the foundation model used to run `SCimilarity` | +| `consensus_celltype_methods` | If consensus cell types are present, a vector with the names of the automated methods used to generate consensus cell type annotations | +| `infercnv_reference_celltypes` | Vector of consensus cell type labels which would be specified for the normal reference cells in `inferCNV` reference. Only present when consensus cell types were assigned and CNV inference was attempted. If CNV inference was attempted and the sample was obtained from normal or non-cancerous tissue this will not be present | +| `infercnv_num_reference_cells` | The total number of normal reference cells included in the `inferCNV` reference. Only present when consensus cell types were assigned and CNV inference was attempted. If CNV inference was attempted and the sample was obtained from normal or non-cancerous tissue this will be `NA` | +| `infercnv_diagnosis_groups` | The broad diagnosis group used to determine which cell types should be considered as part of the normal cell reference for `inferCNV`. If this value is `Non-cancerous`, `inferCNV` was not run | +| `infercnv_success` | Boolean indicating if `inferCNV` succeeded. `TRUE` indicates that `inferCNV` successfully ran, `FALSE` indicates that `inferCNV` was attempted but failed during inference for an unknown reason, and `NA` indicates that `inferCNV` was not run due to insufficient normal reference cells. Only present when CNV inference was attempted | +| `infercnv_options` | A list containing the contents of the `objects` slot from the [final `inferCNV` results object](https://github.com/broadinstitute/infercnv/wiki/Output-Files#infercnv-object-files) containing all parameters specified used when running `inferCNV`. Only present when CNV inference was performed | +| `infercnv_table` | The `inferCNV` metadata table with detailed CNV results output from `inferCNV`. This data frame corresponds to the [`map_metadata_from_infercnv.txt` table calculated from the HMM results](https://github.com/broadinstitute/infercnv/wiki/Extracting-features). Only present when CNV inference was performed | Unlike for {ref}`individual SingleCellExperiment objects`, cluster algorithm parameters are not included in these metadata lists because clusters themselves are not included in the merged object. @@ -193,7 +213,7 @@ colData(merged_sce) # sample metadata only for projects without multiplexing ``` If the project contains multiplexed libraries, this information is stored in the `metadata` slot in the `sample_metadata` field as a `data.frame`. -Similar to merged objects with multiplexed libraries, all {ref} `individual library objects` will contain this sample metadata in the `SingleCellExperiment` object's `metadata` slot. +Similar to merged objects with multiplexed libraries, all {ref}`individual library objects` will contain this sample metadata in the `SingleCellExperiment` object's `metadata` slot. ```r metadata(merged_sce)$sample_metadata # sample metadata only for projects with multiplexed samples diff --git a/docs/processing_information.md b/docs/processing_information.md index 07078f7c..eefdc923 100644 --- a/docs/processing_information.md +++ b/docs/processing_information.md @@ -54,6 +54,9 @@ Only cells that pass this FDR threshold are included in the filtered counts matr For some libraries, `DropletUtils::emptyDropsCellRanger()` may fail due to low numbers of droplets with reads or other violations of its assumptions. For these libraries, only droplets containing at least 100 UMI are included in the filtered counts matrix. +We additionally used [`scDblFinder`](https://www.bioconductor.org/packages/release/bioc/html/scDblFinder.html) to predict whether cells present in this filtered object are singlets or doublets. +We provide the results from this analysis in the filtered and processed objects, but we do not perform any filtering based on these results. + ### Processed gene expression data In addition to the raw gene expression data, we also provide a processed `SingleCellExperiment` object with further filtering applied, a normalized counts matrix, and results from dimensionality reduction. @@ -64,6 +67,10 @@ Cells with a high likelihood of being compromised (greater than 0.75) and cells In certain circumstances, `miQC` modeling may fail; in these cases, only cells which do not pass the threshold of at least 200 unique genes are removed. Log-normalized counts are calculated using the deconvolution method presented in [Lun, Bach, and Marioni (2016)](https://doi.org/10.1186/s13059-016-0947-7). +Specifically, `scran::quickCluster()` was used to derive cell clusters on which to calculate sum factors with `scran::computeSumFactors()`, which are in turn used during normalization with `scuttle::logNormCounts()`. +If this deconvolution-based approach failed for any reason, only `scuttle::logNormCounts()` was used for normalization. + +Next, `scran::modelGeneVar()` was used to model gene variance from the log-normalized counts and `scran::getTopHVGs()` was used to select the top 2000 high-variance genes. The log-normalized counts are used to model variance of each gene prior to selecting the top 2000 highly variable genes (HVGs). These HVGs are then used as input to principal component analysis, and the top 50 principal components are selected. Finally, these principal components are used to calculate the [UMAP (Uniform Manifold Approximation and Projection)](http://bioconductor.org/books/3.13/OSCA.basic/dimensionality-reduction.html#uniform-manifold-approximation-and-projection) embeddings. @@ -72,10 +79,11 @@ Finally, these principal components are used to calculate the [UMAP (Uniform Man #### Cell type annotation -We perform cell type annotation with two complementary methods, where possible: +We perform cell type annotation with three complementary methods, where possible, and assign a single consensus cell type annotation based on agreement among these methods: - [`SingleR`](https://bioconductor.org/packages/release/bioc/html/SingleR.html), a reference-based cell type annotation method ([Looney _et al._ 2019](https://doi.org/10.1038/s41590-018-0276-y)) - [`CellAssign`](https://github.com/Irrationone/cellassign), a marker-gene-based cell type annotation method ([Zhang _et al._ 2019](https://doi.org/10.1038/s41592-019-0529-1)) +- [`SCimilarity`](https://genentech.github.io/scimilarity/index.html), a cell atlas foundation model ([Heimberg _et al._ 2025](https://doi.org/10.1038/s41586-024-08411-y)) For `SingleR` annotation, we identify an appropriate reference dataset from the [`celldex` package](http://bioconductor.org/packages/release/data/experiment/html/celldex.html) and train the classification model to use ontology IDs for annotation. Cells which `SingleR` cannot confidently assign are labeled as `NA`. @@ -88,18 +96,24 @@ As a consequence, cells which `CellAssign` cannot confidently annotate from the Please be aware that all cell type annotation reference datasets are derived from normal (not tumor) tissue. In addition, `CellAssign` annotation is only performed if there are at least 30 cells present in the `processed` object. -Some cells may be labeled as "Unclassified cell" if they were not annotated with `SingleR` or `CellAssign`. -These are cells which were not present in previous ScPCA data versions on which cell typing was initially performed, so they were not labeled. +For `SCimilarity` annotation, we use the foundation model described in [Heimberg _et al._ 2025](https://doi.org/10.1038/s41586-024-08411-y) that contains 7.3 million cells from various normal and diseased tissues to annotate all samples. +Each cell is annotated with the cell type label of the most similar cell in the `SCimilarity` model. + +Some cells may be labeled as “Unclassified cell” if they were not annotated with a given automated method. +These cells were not present in earlier ScPCA data versions on which cell typing was originally performed and are therefore not labeled. + +Additionally, annotations from `SingleR`, `CellAssign`, and `SCimilarity` are used to assign an ontology-aware consensus cell type label. -Additionally, annotations from `SingleR` and `CellAssign` are used to assign an ontology-aware consensus cell type label. -The [latest common ancestor (LCA)](https://rdrr.io/bioc/ontoProc/man/findCommonAncestors.html) between the `SingleR` and `CellAssign` cell type assignments is used as the consensus cell type label if the following criteria are met, otherwise no consensus cell type is assigned: +Consensus cell types are assigned if two out of the three cell type methods share an [latest common ancestor (LCA)](https://rdrr.io/bioc/ontoProc/man/findCommonAncestors.html) that meets the following criteria, otherwise no consensus cell type is assigned: -1. The terms share only one distinct LCA. -The only exception to this rule is if the terms share two LCAs and one of which is `hematopoietic precursor cell`, then `hematopoietic precursor cell` is used as the consensus label. +1. The terms share at least 1 LCA that either has fewer than 170 descendants or is one of `neuron`, `epithelial cell`, `columnar/cuboidal epithelial cell` or `endo-epithelial cell`. -2. The LCA has fewer than 170 descendants, or is either `neuron` or `epithelial cell`. +2. If more than 1 LCA is shared between two terms, then the LCA with the fewest descendants is kept and all others are discarded. -If the LCA is one of the following non-specific LCA terms, no consensus cell type is assigned: `bone cell`, `lining cell`, `blood cell`, `progenitor cell`, and `supporting cell`. +3. If the LCA has fewer than 170 descendants and is one of the following non-specific LCA terms, no consensus cell type is assigned: `bone cell`, `lining cell`, `blood cell`, `progenitor cell`, `supporting cell`, `biogenic amine secreting cell`, `protein secreting cell`, `extracellular matrix secreting cell`, `serotonin secreting cell`, `peptide hormone secreting cell`, `exocrine cell`, `sensory receptor cell`, or `interstitial cell`. + +If more than one LCA is identified as a possible consensus cell type, meaning there is agreement among all three methods, the LCA with the fewest descendants is used as the consensus cell type. +For more information about how consensus cell types are assigned, see the [`cell-type-consensus` module in the `OpenScPCA-analysis` GitHub repository](https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/main/analyses/cell-type-consensus). Cell type annotation is not performed for cell line samples. For information on how to determine if a given sample was derived from a cell line, refer to section(s) describing {ref}`SingleCellExperiment file contents ` and/or {ref}`AnnData file contents `. @@ -107,6 +121,37 @@ For information on how to determine if a given sample was derived from a cell li **Note:** For some libraries, cell type annotations were provided from the group that submitted the original data. In these cases, the cell type annotations obtained from the submitter will be present in addition to cell type annotation performed with `SingleR` and `CellAssign`. +##### Cell type annotations from the OpenScPCA project + +As part of the ongoing [OpenScPCA project](https://openscpca.readthedocs.io/en/latest/), cell types are annotated and validated on a project-by-project basis using methods and references that are most appropriate for the disease types represented in that project. +If cell type annotation has been completed for all samples in a project, these curated cell types will be included alongside the automated cell type annotations. +For more information on where to find these annotations in the downloaded objects, refer to section(s) describing {ref}`SingleCellExperiment file contents ` and/or {ref}`AnnData file contents `. + +For more details on how cells from a specific project were annotated, see the appropriate module in the [`OpenScPCA-analysis` repository](https://github.com/AlexsLemonade/OpenScPCA-analysis). +The name of the module and other versioning information can be found in the {ref}`experiment metadata for SingleCellExperiment and AnnData objects `. + +The OpenScPCA project is an ongoing open and collaborative effort to characterize the ScPCA Portal data. +For more information on the project, including contributing your own analyses, see the [OpenScPCA documentation](https://openscpca.readthedocs.io/en/latest/). + +#### CNV inference + +We perform CNV inference using [`inferCNV`](https://github.com/broadinstitute/infercnv), specifying the [`i6` HMM](https://github.com/broadinstitute/infercnv/wiki/infercnv-i6-HMM-type) to quantify specific CNV events. + +`inferCNV` uses a designated set of normal reference cells to quantify CNV events based on gene expression. +We use the consensus cell type labels, as described in the [cell type annotation section](#cell-type-annotation), to establish normal references for each sample. +The specific cell types to include are determined by each sample's diagnosis. +We [designate cells as either `reference` or `query`](https://github.com/broadinstitute/inferCNV/wiki/File-Definitions#sample-annotation-file), rather than using their specific cell type labels, where the label `reference` was used to specify normal reference cells. +`inferCNV` is only run if there are at least 100 cells designated as `reference` in a given sample. +As such, `inferCNV` is not run on cell lines because they do not undergo cell type annotation or on samples obtained from normal or non-cancerous tissue. + +We additionally specify a [gene ordering file](https://github.com/broadinstitute/inferCNV/wiki/File-Definitions#gene-ordering-file) with chromosome arm designations (e.g., `chr1p` and `chr1q` are used rather than `chr1`) for finer-grained results. +We use all other `inferCNV` defaults, except we set `denoise = TRUE` and `cutoff = 0.1` (for 10x data) [as recommended](https://github.com/broadinstitute/inferCNV/wiki#quick-start). +Note that we keep the default `inferCNV` setting to remove any cells with raw RNA counts less than 100; these cells will not have `inferCNV` estimates. + +We calculate the total CNV per cell [using the feature output from the `i6` HMM](https://github.com/broadinstitute/infercnv/wiki/Extracting-features) by summing all values in the HMM metadata table columns named `has_cnv_{chr}{1:22}{p,q}` (e.g., `has_cnv_chr1p`, `has_cnv_chr1q`, and so on). +Any cells which `inferCNV` removed due to low counts will not have a total CNV estimate. + + ## ADT quantification from CITE-seq experiments CITE-seq libraries with reads from antibody-derived tags (ADTs) were also quantified using [`salmon`](https://salmon.readthedocs.io/en/latest) and [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), rounded to integer values. @@ -126,7 +171,8 @@ When cells were [filtered based on RNA-seq content](#filtering-cells) after quan An ambient profile representing antibody-derived tag (ADT) proportions present in the ambient solution is calculated from the unfiltered `SingleCellExperiment` object using [`DropletUtils::ambientProfileEmpty()`](https://rdrr.io/github/MarioniLab/DropletUtils/man/ambientProfileEmpty.html). Quality-control statistics were calculated with [`DropletUtils::cleanTagCounts()`](https://rdrr.io/github/MarioniLab/DropletUtils/man/cleanTagCounts.html) (with default parameters) using this ambient profile, along with negative/isotype control information, if present. -Low-quality cells identified by `DropletUtils::cleanTagCounts()` (those having high levels of ambient contamination or substantial negative/isotype control tags) are flagged but not removed except during normalization, as described below. +This function flags cells as low-quality if they either have very high levels of ambient contamination and/or negative/isotype control tags (if present), or lack ambient expression altogether which may indicate failed capture. +Low-quality cells identified by `DropletUtils::cleanTagCounts()` are flagged but not removed except during normalization, as described below. If `DropletUtils::cleanTagCounts()` cannot reliably determine which cells to filter, then no cells will be flagged for removal. For all cells that would be retained if `DropletUtils::cleanTagCounts()` filtering were applied, log-normalized ADT counts are, by default, calculated using [median-based normalization](http://bioconductor.org/books/3.16/OSCA.advanced/integrating-with-protein-abundance.html#cite-seq-median-norm), again making use of the baseline ambient profile. diff --git a/docs/sce_file_contents.md b/docs/sce_file_contents.md index 59968364..08b5a0a4 100644 --- a/docs/sce_file_contents.md +++ b/docs/sce_file_contents.md @@ -68,7 +68,7 @@ The following per-cell data columns are included for each cell, calculated using | `total` | Total UMI count for RNA-seq data and any alternative experiments (i.e., ADT data from CITE-seq) | The following additional per-cell data columns are included in both the `filtered` and `processed` objects. -These columns include metrics calculated by [`miQC`](https://bioconductor.org/packages/release/bioc/html/miQC.html), a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected to predict low-quality cells. +These columns include metrics calculated by [`miQC`](https://bioconductor.org/packages/release/bioc/html/miQC.html), a package that jointly models proportion of reads belonging to mitochondrial genes and number of unique genes detected to predict low-quality cells, as well as [`scDblFinder`](https://www.bioconductor.org/packages/release/bioc/html/scDblFinder.html), a package to predict doublets. We also include the filtering results used for the creation of the `processed` objects. See the description of the {ref}`processed gene expression data ` for more information on filtering performed to create the `processed` objects. @@ -78,11 +78,16 @@ See the description of the {ref}`processed gene expression data 200) | | `adt_scpca_filter` | If CITE-seq was performed, labels cells as either `Keep` or `Remove` based on ADT filtering criteria (`discard = TRUE` as determined by [`DropletUtils::CleanTagCounts()`](https://rdrr.io/github/MarioniLab/DropletUtils/man/cleanTagCounts.html)) | -| `submitter_celltype_annotation` | If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as `"Submitter-excluded"` | +| `scDblFinder_class` | The `scDblFinder` predicted classification, either "singlet" or "doublet" | +| `scDblFinder_score` | The `scDblFinder` doublet score, representing the probability that the cell is a doublet | + +The `processed` object contains additional `colData` column(s): -The `processed` object has one additional `colData` column reflecting cluster assignments. -Further, if cell type annotation was performed, there will be additional columns representing annotation results in the `processed` object's `colData`, as described in the {ref}`cell type annotation processing section `. +* A column with graph-based clustering assignments will be present +Note that these clusters were calculated with default parameters and were not evaluated, as described in the {ref}`section on processed gene expression data ` +* If cell type annotation was performed, columns containing annotation results will be present, as described in the {ref}`cell type annotation processing section ` +* If CNV inference was performed, columns containing these results will be present, as described in the {ref}`CNV inference processing section ` | Column name | Contents | | ----------------------- | ----------------------------------------------------- | @@ -92,8 +97,22 @@ Further, if cell type annotation was performed, there will be additional columns | `cellassign_celltype_annotation` | If cell typing with `CellAssign` was performed, the annotated cell type. Cells labeled as `"other"` are those which `CellAssign` could not confidently annotate. If `CellAssign` was unable to complete successfully, cells will be labeled as `"Not run"`. Cells labeled as `"Unclassified cell"` did not undergo cell type annotation with `CellAssign` | | `cellassign_celltype_ontology` | If cell typing with `CellAssign` was performed, the annotated cell type ontology ID. Cells labeled as `NA` are those which `CellAssign` could not confidently annotate. If `CellAssign` was unable to complete successfully, cells will be labeled as `"Not run"` | | `cellassign_max_prediction` | If cell typing with `CellAssign` was performed and completed successfully, the annotation's prediction score (probability) | -| `consensus_celltype_annotation` | The assigned consensus cell type annotation as determined by finding the latest common ancestor between the `SingleR` and `CellAssign` cell type annotations. Cells labeled as `"Unknown"` do not have an appropriate consensus cell type label that could be assigned. This column is only present if `SingleR` and `CellAssign` were run | -| `consensus_celltype_ontology` | The assigned consensus cell type ontology ID as determined by finding the latest common ancestor between the `SingleR` and `CellAssign` cell type annotations. Cells labeled as `NA` do not have an appropriate consensus cell type label that could be assigned. This column is only present if `SingleR` and `CellAssign` were run | +| `scimilarity_celltype_annotation` | If cell typing with `SCimilarity` was performed, the annotated cell type. Cells labeled as `"Unclassified cell"` did not undergo cell type annotation with `SCimilarity` | +| `scimilarity_celltype_ontology` | If cell typing with `SCimilarity` was performed, the annotated cell type ontology ID. | +| `scimilarity_min_distance` | If cell typing with `SCimilarity` was performed and completed successfully, the minimum distance of the cell to all cells in the training set | +| `consensus_celltype_annotation` | The assigned consensus cell type annotation as determined by finding the latest common ancestor among `SingleR`, `CellAssign`, and `SCimilarity` cell type annotations. Cells labeled as `"Unknown"` do not have an appropriate consensus cell type label that could be assigned. This column is only present if at least two cell typing methods (`SingleR`, `CellAssign`, or `SCimilarity`) were run | +| `consensus_celltype_ontology` | The assigned consensus cell type ontology ID as determined by finding the latest common ancestor among `SingleR`, `CellAssign`, and `SCimilarity` cell type annotations. Cells labeled as `NA` do not have an appropriate consensus cell type label that could be assigned. This column is only present if at least two cell typing methods (`SingleR`, `CellAssign`, or `SCimilarity`) were run | +| `is_infercnv_reference` | Whether the cell would be considered part of the reference normal cells for `inferCNV` inference. Only present when consensus cell types were assigned and CNV inference was attempted. If CNV inference was attempted and the sample is obtained from a normal or non-cancerous tissue this will not be present | +| `infercnv_total_cnv` | If CNV inference was performed, the total number of CNV events in the cell calculated by `inferCNV`. Cells labeled `NA` were either filtered out during `inferCNV` calculations, as described in the {ref}`CNV inference processing section `, or may be the result of `inferCNV` result caching. This column is only present if `inferCNV` was successfully run | + +For some libraries, cell types were annotated either by submitters or through the [`OpenScPCA` project](https://openscpca.readthedocs.io/en/latest/) as described in the {ref}`cell type annotation processing section `. +In this case, these additional columns will be present in all three objects (unfiltered, filtered, and processed) containing the associated cell type annotations: + +| Column name | Contents | +| ----------- | -------- | +| `submitter_celltype_annotation` | If available, cell type annotations obtained from the group that submitted the original data. Cells that the submitter did not annotate are labeled as `"Submitter-excluded"` | +| `openscpca_celltype_annotation` | If available, cell type annotations obtained from the [OpenScPCA project](https://openscpca.readthedocs.io/en/latest/) as determined by analysis performed in the [`OpenScPCA-analysis` GitHub repository](https://github.com/AlexsLemonade/OpenScPCA-analysis). Cells that were not annotated as part of OpenScPCA are labeled as `"openscpca-excluded"` | +| `openscpca_celltype_ontology` | If available, the Cell Ontology identifier associated with the `openscpca_celltype_annotation`. `NA` will be used if there is no appropriate cell ontology term | ### SingleCellExperiment gene information and metrics @@ -146,14 +165,17 @@ metadata(sce) # experiment metadata | `scpca_filter_method` | Method used by the Data Lab to filter low quality cells prior to normalization. Either `miQC` or `Minimum_gene_cutoff` if `miQC` modeling failed | | `adt_scpca_filter_method` | If CITE-seq was performed, the method used by the Data Lab to identify cells to be filtered prior to normalization, based on ADT counts. Either `cleanTagCounts with isotype controls` or `cleanTagCounts without isotype controls`. If filtering failed (i.e. `DropletUtils::cleanTagCounts()` could not reliably determine which cells to filter), the value will be `No filter` | | `min_gene_cutoff` | The minimum cutoff for the number of unique genes detected per cell used to filter cells. Only present for `filtered` and `processed` objects | -| `normalization` | The method used for normalization of raw RNA counts. Either `deconvolution`, described in [Lun, Bach, and Marioni (2016)](https://doi.org/10.1186/s13059-016-0947-7), or `log-normalization`. Only present for `processed` objects | +| `normalization` | The method used for normalization of raw RNA counts. Either `deconvolution` or `log-normalization`, as explained in the {ref}`processed gene expression data section `. Only present for `processed` objects | | `adt_normalization` | If CITE-seq was performed, the method used for normalization of raw ADT counts. Either `median-based` or `log-normalization`, as explained in the {ref}`processed ADT data section `. Only present for `processed` objects | | `highly_variable_genes` | A vector of highly variable genes used for dimensionality reduction, determined using `scran::modelGeneVar` and `scran::getTopHVGs`. Only present for `processed` objects | | `cluster_algorithm` | The algorithm used to perform graph-based clustering of cells. Only present for `processed` objects | | `cluster_weighting` | The weighting approach used during graph-based clustering. Only present for `processed` objects | | `cluster_nn` | The nearest neighbor parameter value used for the graph-based clustering. Only present for `processed` objects | -| `celltype_methods` | If cell type annotation was performed, a vector of the methods used for annotation. May include `"submitter"`, `"singler"` and/or `"cellassign"`. If submitter cell-type annotations are available, this metadata item will be present in all objects. Otherwise, this item will only be in `processed` objects | -| `singler_results` | If cell typing with `SingleR` was performed, the full result object returned by `SingleR` annotation. Only present for `processed` objects | +| `celltype_methods` | If cell type annotation was performed, a vector of the methods used for annotation. May include `"submitter"`, `"openscpca"`, `"singler"` and/or `"cellassign"`. If submitter or OpenScPCA cell type annotations are available, this metadata item will be present in all objects. Otherwise, this item will only be in `processed` objects | +| `openscpca_celltype_module_name` | If cell type annotations from the OpenScPCA project are available, the original module name from the [`OpenScPCA-analysis` GitHub repository](https://github.com/AlexsLemonade/OpenScPCA-analysis) | +| `openscpca_celltype_nf_version` | If cell type annotations from the OpenScPCA project are available, the version of the [`OpenScPCA-nf` workflow](https://github.com/AlexsLemonade/OpenScPCA-nf) used to generate annotations | +| `openscpca_celltype_release_date` | If cell type annotations from the OpenScPCA project are available, the release date for the input ScPCA data used when assigning annotations | +| `singler_results` | If cell typing with `SingleR` was performed, the full [`DataFrame`](https://rdrr.io/bioc/S4Vectors/man/DataFrame-class.html) result object returned by `SingleR` annotation. Only present for `processed` objects | | `singler_reference` | If cell typing with `SingleR` was performed, the name of the reference dataset used for annotation. Only present for `processed` objects | | `singler_reference_label` | If cell typing with `SingleR` was performed, the name of the label in the reference dataset used for annotation. Only present for `processed` objects | | `singler_reference_source` | If cell typing with `SingleR` was performed, the source of the reference dataset (default is [`celldex`](http://bioconductor.org/packages/release/data/experiment/html/celldex.html)). Only present for `processed` objects | @@ -163,6 +185,14 @@ metadata(sce) # experiment metadata | `cellassign_reference_organs` | If cell typing with `CellAssign` was performed and completed successfully, a comma-separated list of organs and/or tissue compartments from which marker genes were obtained to create the reference. Only present for `processed` objects | | `cellassign_reference_source` | If cell typing with `CellAssign` was performed and completed successfully, the source of the reference dataset (default is [`PanglaoDB`](https://panglaodb.se/)). Only present for `processed` objects | | `cellassign_reference_version` | If cell typing with `CellAssign` was performed and completed successfully, the version of the reference dataset source. For references obtained from `PanglaoDB`, the version scheme is a date in ISO8601 format. Only present for `processed` objects | +| `scimilarity_model` | If cell typing with `SCimilarity` was performed, the name of the foundation model used to run `SCimilarity` | +| `consensus_celltype_methods` | If consensus cell types are present, a vector with the names of the automated methods used to generate consensus cell type annotations | +| `infercnv_reference_celltypes` | Vector of consensus cell type labels which would be specified for the normal reference cells in `inferCNV` reference. Only present when consensus cell types were assigned and CNV inference was attempted. If CNV inference was attempted and the sample is obtained from normal or non-cancerous tissue this will not be present | +| `infercnv_num_reference_cells` | The total number of normal reference cells included in the `inferCNV` reference. Only present when consensus cell types were assigned and CNV inference was attempted. If CNV inference was attempted and the sample is obtained from normal or non-cancerous tissue this will be `NA` | +| `infercnv_diagnosis_groups` | The broad diagnosis group used to determine which cell types should be considered as part of the normal cell reference for `inferCNV`. If this value is `Non-cancerous`, `inferCNV` was not run | +| `infercnv_success` | Boolean indicating if `inferCNV` succeeded. `TRUE` indicates that `inferCNV` successfully ran, `FALSE` indicates that `inferCNV` was attempted but failed during inference for an unknown reason, and `NA` indicates that `inferCNV` was not run due to insufficient normal reference cells. Only present when CNV inference was attempted | +| `infercnv_options` | A list containing the contents of the `objects` slot from the [final `inferCNV` results object](https://github.com/broadinstitute/infercnv/wiki/Output-Files#infercnv-object-files) containing all parameters specified used when running `inferCNV`. Only present when CNV inference was performed | +| `infercnv_table` | The `inferCNV` metadata table with detailed CNV results output from `inferCNV`. This data frame corresponds to the [`map_metadata_from_infercnv.txt` table calculated from the HMM results](https://github.com/broadinstitute/infercnv/wiki/Extracting-features). Only present when CNV inference was performed | ### SingleCellExperiment sample metadata @@ -443,8 +473,13 @@ adata_object.uns # experiment metadata All of the object metadata included in `SingleCellExperiment` objects are present in the `.uns` slot of the `AnnData` object. To see a full description of the included columns, see the [section on experiment metadata in `Components of a SingleCellExperiment object`](#singlecellexperiment-experiment-metadata). -The only exception is that the `AnnData` object _does not_ contain the `sample_metadata` item in the `.uns` slot. +There are two exceptions to this: + +* The `AnnData` object does not contain the `sample_metadata` item in the `.uns` slot. Instead, the contents of the `sample_metadata` data frame are stored in the cell-level metadata (`.obs`). +* The `AnnData` object does not contain any metadata fields whose type could not be automatically converted to a Python object. +This includes any `list` type fields present in the `SingleCellExperiment` metadata. + The `AnnData` object also includes the following additional items in the `.uns` slot: