DeepTaxa is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.
- Overview
- Key Features
- Installation
- Data and Pre-Trained Models
- Usage
- Troubleshooting
- License
- Citation
- Contact
- Acknowledgements
DeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.
- CNNClassifier: A convolutional neural network optimized for extracting local sequence features.
- BERTClassifier: A BERT-based model that captures global contextual relationships within sequences.
- HybridCNNBERTClassifier: A hybrid approach combining CNN and BERT for superior accuracy and robustness.
- Hierarchical Taxonomy Prediction: Classifies sequences across seven taxonomic levels in a single pass.
- Multiple Model Options: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.
- Customizable Training: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.
- GPU Acceleration: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.
DeepTaxa requires Python 3.10 or later and is recommended to be installed within a Conda environment for dependency management.
Dependencies are specified in pyproject.toml
and will be installed automatically during setup. Key requirements include:
- torch
- transformers
- pandas
- numpy
- tqdm
- scikit-learn
- biopython
- h5py
- optuna
- Clone the Repository:
git clone https://github.com/systems-genomics-lab/deeptaxa.git cd deeptaxa
- Set Up a Conda Environment:
conda create --name deeptaxa_env python=3.10 -y conda activate deeptaxa_env
- Install DeepTaxa and Dependencies:
pip install . # Installs DeepTaxa along with dependencies from pyproject.toml
- Verify Installation:
deeptaxa --version # Displays the installed DeepTaxa version
Note: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the PyTorch website for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running
pip install .
.
To maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., deeptaxa-data
) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs
directory.
Here’s how the folders should be organized relative to each other:
working_directory/
├── deeptaxa/ # Cloned repository folder (codebase)
│ ├── LICENSE # LICENSE
│ ├── README.md # This file
│ ├── pyproject.toml # Configuration file
│ ├── deeptaxa/ # Source code subdirectory
│ └── scripts/ # Supplementary scripts
├── deeptaxa-data/ # External folder for datasets and models
│ ├── greengenes/ # Subdirectory for Greengenes dataset
│ │ ├── gg_2024_09_training.fna.gz
│ │ ├── gg_2024_09_training.tsv.gz
│ │ ├── gg_2024_09_testing.fna.gz
│ │ ├── gg_2024_09_testing.tsv.gz
│ └── models/ # Subdirectory for pre-trained models
│ └── deeptaxa_april_2025.pt
└── deeptaxa-outputs/ # External folder for generated outputs
├── model_checkpoint.pt # Trained model checkpoint
├── predictions/ # Subdirectory for prediction outputs
│ ├── predictions.json
│ └── predictions.tsv
└── metrics/ # Subdirectory for exported metrics
└── model_description.json
Commands in the Usage section are run from within deeptaxa/
, using ../
to access deeptaxa-data/
and deeptaxa-outputs/
.
DeepTaxa uses the Greengenes2. Modified, reformatted, and made available on Hugging Face 🤗.
- Available Files:
File Name | Type | Number of Sequences | Size |
---|---|---|---|
gg_2024_09_training.fna.gz |
Training FASTA (sequences) | 277,336 | ~96.4 MB |
gg_2024_09_training.tsv.gz |
Training TSV (taxonomy labels) | 277,336 | ~2.6 MB |
gg_2024_09_testing.fna.gz |
Testing FASTA (sequences) | 69,335 | ~24.1 MB |
gg_2024_09_testing.tsv.gz |
Testing TSV (taxonomy labels) | 69,335 | ~0.8 MB |
mkdir -p deeptaxa-data/greengenes
cd deeptaxa-data/greengenes
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz
Pre-trained models are available for immediate use and hosted on Hugging Face 🤗.
- Hybrid CNN-BERT Model:
deeptaxa_april_2025.pt
- A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.
- Includes a
config.json
file with model metadata.
- License: MIT
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json
Note: The
deeptaxa_april_2025.pt
file uses PyTorch’s default serialization withpickle
. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.
DeepTaxa offers a versatile command-line interface (deeptaxa.cli
) for training, checkpoint inspection, and prediction tasks. All commands should be run from the deeptaxa/
directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs
directory outside the codebase. Replace file paths in the examples below with your local data and output locations.
To train and predict with DeepTaxa:
- Install DeepTaxa from source (see Installation).
- Download the pre-trained hybrid model:
mkdir -p deeptaxa-data/models cd deeptaxa-data/models wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
Note: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.
- Create the outputs directory:
mkdir -p ../deeptaxa-outputs # Creates the external outputs folder if it doesn’t exist
- Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):
- Download the Greengenes dataset (see Data).
- Run the training command:
deeptaxa train \ # Runs the training command --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Path to training sequences --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Path to training labels --model-type hybrid \ # Specifies the hybrid CNN-BERT architecture --output-dir ../deeptaxa-outputs/ # Directory to save the trained model checkpoint
- Predict on your data using the pre-trained model:
deeptaxa predict \ # Runs the prediction command --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Path to test sequences --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model --output-dir ../deeptaxa-outputs/predictions # Directory to save prediction results
Train a new model using the Greengenes dataset:
deeptaxa train \ # Initiates model training
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Input FASTA file with sequences
--taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Taxonomy labels for training
--model-type hybrid \ # Model architecture: cnn, bert, or hybrid
--output-dir ../deeptaxa-outputs/ \ # Where to save the trained model
--epochs 10 \ # Number of training epochs
--batch-size 16 \ # Batch size for training
--learning-rate 1e-4 \ # Learning rate for optimization
--device cuda # Use GPU (cuda) or CPU (cpu)
Examine a pre-trained model’s metadata and performance metrics:
deeptaxa describe \ # Describes a model checkpoint
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model
--export-metrics ../deeptaxa-outputs/metrics/model_description.json # Where to save metadata and metrics
Classify sequences from a FASTA file using a trained model:
deeptaxa predict \ # Generates taxonomic predictions
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Input FASTA file for prediction
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the trained model checkpoint
--output-dir ../deeptaxa-outputs/predictions \ # Directory to save prediction outputs
--top-k 3 \ # Number of top predictions per level
--tabular # Exports results in TSV format
../deeptaxa-outputs/predictions/predictions.json
: Detailed predictions with confidence scores and uncertainty metrics.../deeptaxa-outputs/predictions/predictions.tsv
: Tabular format for downstream analysis.
Tip: Use
--help
with any CLI command (e.g.,python -m deeptaxa.cli train --help
) for a full list of options. Ensure thedeeptaxa-outputs
directory exists (e.g.,mkdir -p ../deeptaxa-outputs
) before running commands.
For a full demo of working with DeepTaxa, see the following notebooks:
deeptaxa_prediction.ipynb
: making a taxonomic classification using the pre-trained modeldeeptaxa_april_2025.pt
deeptaxa_workflow.ipynb
: training a fresh model, resuming training on an existing model, and making predictions
- Code & Models: MIT License
- Greengenes Dataset: Modified BSD License
The Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the dataset repository on Hugging Face.
If DeepTaxa contributes to your research, please cite:
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
For the Greengenes dataset, cite:
DeSantis TZ, et al. (2006). Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology. DOI:10.1128/AEM.03006-05.
To report bugs, suggest features, or submit code, please open an issue on GitHub.
- Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
- Ahmed A. El Hosseiny and the High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.
- Hugging Face to provide a platform to host datasets and models.