Skip to content

OmicsML/scLinguist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

scLinguist: a Foundation Model for Cross-Modality Translation in Single-Cell Omics

Overview

In this work, we introduce scLinguist, a novel cross-modal foundation model based on an encoder–decoder architecture, designed to predict protein abundance from single-cell transcriptomic profiles. Drawing inspiration from multilingual translation models, scLinguist adopts a two-stage learning paradigm: it first performs modality-specific pretraining on large-scale unpaired omics data (e.g., RNA or protein) to capture intra-modality expression patterns, and then conducts post-pretraining on paired RNA–protein data to learn cross-modality mappings. This strategy enables the model to integrate knowledge from both data-rich and data-scarce scenarios, enhancing its generalizability and robustness across diverse biological contexts.

Installation

We tested our code on a server running Ubuntu 18.04.5 LTS, equipped with NVIDIA H100 GPUs.

git clone https://github.com/OmicsML/scLinguist
cd scLinguist
conda create -n scLinguist python=3.8.8
conda activate scLinguist
pip install -r requirements.txt

# install torch
pip install torch==2.1.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html

python setup.py install

Tutorial

TODO We provided detailed tutorials on applying scLinguist to various tasks. Please refer to https://scLinguist.readthedocs.io/en/latest/.

Pre-trained model

The pre-trained models can be downloaded from these links.

Model name Description Download
scLinguist pretrained RNA Pretrained on over 15 million human cells. link
scLinguist pretrained Protein Pretrained on over 11 million human cells. link
scLinguist post-pretrained RNA-Protein Post-Pretrained on 3 million paired cells. link

Data

Source of public datasets:

  1. BM dataset: CITE-seq
  2. BMMC dataset: CITE-seq
  3. CBMC dataset: CITE-seq
  4. PBMC dataset: REAP-seq
  5. Perturb dataset: ECCITE-seq
  6. Heart dataset: CITE-seq
  7. Spatial dataset: 10X Visium

About

The official repo for scLinguist

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published