DeepTaxa

DeepTaxa is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.

Overview

DeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.

Supported Architectures

CNNClassifier: A convolutional neural network optimized for extracting local sequence features.
BERTClassifier: A BERT-based model that captures global contextual relationships within sequences.
HybridCNNBERTClassifier: A hybrid approach combining CNN and BERT for superior accuracy and robustness.

Key Features

Hierarchical Taxonomy Prediction: Classifies sequences across seven taxonomic levels in a single pass.
Multiple Model Options: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.
Customizable Training: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.
GPU Acceleration: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.

Installation

DeepTaxa requires Python 3.10 or later and is recommended to be installed within a Conda environment for dependency management.

Dependencies

Dependencies are specified in pyproject.toml and will be installed automatically during setup. Key requirements include:

torch
transformers
pandas
numpy
tqdm
scikit-learn
biopython
h5py
optuna

Installation Steps

Clone the Repository:

git clone https://github.com/systems-genomics-lab/deeptaxa.git
cd deeptaxa

Set Up a Conda Environment:

conda create --name deeptaxa_env python=3.10 -y
conda activate deeptaxa_env

Install DeepTaxa and Dependencies:

pip install .  # Installs DeepTaxa along with dependencies from pyproject.toml

Verify Installation:

deeptaxa --version  # Displays the installed DeepTaxa version

Note: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the PyTorch website for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running pip install ..

Data and Pre-Trained Models

To maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., deeptaxa-data) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs directory.

Directory Structure Example

Here’s how the folders should be organized relative to each other:

working_directory/
├── deeptaxa/                  # Cloned repository folder (codebase)
│   ├── LICENSE                # LICENSE
│   ├── README.md              # This file
│   ├── pyproject.toml         # Configuration file
│   ├── deeptaxa/              # Source code subdirectory
│   └── scripts/               # Supplementary scripts
├── deeptaxa-data/             # External folder for datasets and models
│   ├── greengenes/            # Subdirectory for Greengenes dataset
│   │   ├── gg_2024_09_training.fna.gz
│   │   ├── gg_2024_09_training.tsv.gz
│   │   ├── gg_2024_09_testing.fna.gz
│   │   ├── gg_2024_09_testing.tsv.gz
│   └── models/                # Subdirectory for pre-trained models
│       └── deeptaxa_april_2025.pt
└── deeptaxa-outputs/          # External folder for generated outputs
    ├── model_checkpoint.pt    # Trained model checkpoint
    ├── predictions/           # Subdirectory for prediction outputs
    │   ├── predictions.json
    │   └── predictions.tsv
    └── metrics/               # Subdirectory for exported metrics
        └── model_description.json

Commands in the Usage section are run from within deeptaxa/, using ../ to access deeptaxa-data/ and deeptaxa-outputs/.

Datasets

DeepTaxa uses the Greengenes2. Modified, reformatted, and made available on Hugging Face 🤗.

Available Files:

File Name	Type	Number of Sequences	Size
`gg_2024_09_training.fna.gz`	Training FASTA (sequences)	277,336	~96.4 MB
`gg_2024_09_training.tsv.gz`	Training TSV (taxonomy labels)	277,336	~2.6 MB
`gg_2024_09_testing.fna.gz`	Testing FASTA (sequences)	69,335	~24.1 MB
`gg_2024_09_testing.tsv.gz`	Testing TSV (taxonomy labels)	69,335	~0.8 MB

Download Instructions

mkdir -p deeptaxa-data/greengenes
cd deeptaxa-data/greengenes
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz

Pre-Trained Models

Pre-trained models are available for immediate use and hosted on Hugging Face 🤗.

Hybrid CNN-BERT Model: deeptaxa_april_2025.pt
- A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.
- Includes a config.json file with model metadata.
License: MIT

Download Instructions

mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json

Note: The deeptaxa_april_2025.pt file uses PyTorch’s default serialization with pickle. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.

Usage

DeepTaxa offers a versatile command-line interface (deeptaxa.cli) for training, checkpoint inspection, and prediction tasks. All commands should be run from the deeptaxa/ directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs directory outside the codebase. Replace file paths in the examples below with your local data and output locations.

Quick Start

To train and predict with DeepTaxa:

Install DeepTaxa from source (see Installation).
Download the pre-trained hybrid model:
```
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
```
Note: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.

Create the outputs directory:

mkdir -p ../deeptaxa-outputs  # Creates the external outputs folder if it doesn’t exist

Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):

Download the Greengenes dataset (see Data).

Run the training command:

deeptaxa train \                         # Runs the training command
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \  # Path to training sequences
  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \  # Path to training labels
  --model-type hybrid \                  # Specifies the hybrid CNN-BERT architecture
  --output-dir ../deeptaxa-outputs/      # Directory to save the trained model checkpoint

Predict on your data using the pre-trained model:

deeptaxa predict \                         # Runs the prediction command
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \  # Path to test sequences
  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the pre-trained model
  --output-dir ../deeptaxa-outputs/predictions  # Directory to save prediction results

Training a Model

Train a new model using the Greengenes dataset:

deeptaxa train \                           # Initiates model training
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \  # Input FASTA file with sequences
  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \  # Taxonomy labels for training
  --model-type hybrid \                    # Model architecture: cnn, bert, or hybrid
  --output-dir ../deeptaxa-outputs/ \      # Where to save the trained model
  --epochs 10 \                            # Number of training epochs
  --batch-size 16 \                        # Batch size for training
  --learning-rate 1e-4 \                   # Learning rate for optimization
  --device cuda                            # Use GPU (cuda) or CPU (cpu)

Inspecting a Checkpoint

Examine a pre-trained model’s metadata and performance metrics:

deeptaxa describe \                        # Describes a model checkpoint
  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the pre-trained model
  --export-metrics ../deeptaxa-outputs/metrics/model_description.json  # Where to save metadata and metrics

Making Predictions

Classify sequences from a FASTA file using a trained model:

deeptaxa predict \                         # Generates taxonomic predictions
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \  # Input FASTA file for prediction
  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the trained model checkpoint
  --output-dir ../deeptaxa-outputs/predictions \  # Directory to save prediction outputs
  --top-k 3 \                              # Number of top predictions per level
  --tabular                                # Exports results in TSV format

Output Files

../deeptaxa-outputs/predictions/predictions.json: Detailed predictions with confidence scores and uncertainty metrics.
../deeptaxa-outputs/predictions/predictions.tsv: Tabular format for downstream analysis.

Tip: Use --help with any CLI command (e.g., python -m deeptaxa.cli train --help) for a full list of options. Ensure the deeptaxa-outputs directory exists (e.g., mkdir -p ../deeptaxa-outputs) before running commands.

Demo

For a full demo of working with DeepTaxa, see the following notebooks:

deeptaxa_prediction.ipynb: making a taxonomic classification using the pre-trained model deeptaxa_april_2025.pt
deeptaxa_workflow.ipynb: training a fresh model, resuming training on an existing model, and making predictions

License

Code & Models: MIT License
Greengenes Dataset: Modified BSD License

The Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the dataset repository on Hugging Face.

Citation

If DeepTaxa contributes to your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

For the Greengenes dataset, cite:
DeSantis TZ, et al. (2006). Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology. DOI:10.1128/AEM.03006-05.

Contact

To report bugs, suggest features, or submit code, please open an issue on GitHub.

Acknowledgements

Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
Ahmed A. El Hosseiny and the High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.
Hugging Face to provide a platform to host datasets and models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepTaxa

Table of Contents

Overview

Supported Architectures

Key Features

Installation

Dependencies

Installation Steps

Data and Pre-Trained Models

Directory Structure Example

Datasets

Download Instructions

Pre-Trained Models

Download Instructions

Usage

Quick Start

Training a Model

Inspecting a Checkpoint

Making Predictions

Output Files

Demo

License

Citation

Contact

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
deeptaxa		deeptaxa
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

systems-genomics-lab/deeptaxa

Folders and files

Latest commit

History

Repository files navigation

DeepTaxa

Table of Contents

Overview

Supported Architectures

Key Features

Installation

Dependencies

Installation Steps

Data and Pre-Trained Models

Directory Structure Example

Datasets

Download Instructions

Pre-Trained Models

Download Instructions

Usage

Quick Start

Training a Model

Inspecting a Checkpoint

Making Predictions

Output Files

Demo

License

Citation

Contact

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages