In this work, we introduce scLinguist, a novel cross-modal foundation model based on an encoder–decoder architecture, designed to predict protein abundance from single-cell transcriptomic profiles. Drawing inspiration from multilingual translation models, scLinguist adopts a two-stage learning paradigm: it first performs modality-specific pretraining on large-scale unpaired omics data (e.g., RNA or protein) to capture intra-modality expression patterns, and then conducts post-pretraining on paired RNA–protein data to learn cross-modality mappings. This strategy enables the model to integrate knowledge from both data-rich and data-scarce scenarios, enhancing its generalizability and robustness across diverse biological contexts.
We tested our code on a server running Ubuntu 18.04.5 LTS, equipped with NVIDIA H100 GPUs.
git clone https://github.com/OmicsML/scLinguist
cd scLinguist
conda create -n scLinguist python=3.8.8
conda activate scLinguist
pip install -r requirements.txt
# install torch
pip install torch==2.1.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
python setup.py install
TODO We provided detailed tutorials on applying scLinguist to various tasks. Please refer to https://scLinguist.readthedocs.io/en/latest/.
The pre-trained models can be downloaded from these links.
Model name | Description | Download |
---|---|---|
scLinguist pretrained RNA | Pretrained on over 15 million human cells. | link |
scLinguist pretrained Protein | Pretrained on over 11 million human cells. | link |
scLinguist post-pretrained RNA-Protein | Post-Pretrained on 3 million paired cells. | link |
Source of public datasets:
- BM dataset:
CITE-seq
- BMMC dataset:
CITE-seq
- CBMC dataset:
CITE-seq
- PBMC dataset:
REAP-seq
- Perturb dataset:
ECCITE-seq
- Heart dataset:
CITE-seq
- Spatial dataset:
10X Visium