Skip to content

systems-genomics-lab/deeptaxa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepTaxa

License Last Commit Issues GitHub Stars GitHub Forks

DeepTaxa is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.


Table of Contents

  1. Overview
  2. Key Features
  3. Installation
  4. Data and Pre-Trained Models
  5. Usage
  6. Troubleshooting
  7. License
  8. Citation
  9. Contact
  10. Acknowledgements

Overview

DeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.

Supported Architectures

  • CNNClassifier: A convolutional neural network optimized for extracting local sequence features.
  • BERTClassifier: A BERT-based model that captures global contextual relationships within sequences.
  • HybridCNNBERTClassifier: A hybrid approach combining CNN and BERT for superior accuracy and robustness.

Key Features

  • Hierarchical Taxonomy Prediction: Classifies sequences across seven taxonomic levels in a single pass.
  • Multiple Model Options: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.
  • Customizable Training: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.
  • GPU Acceleration: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.

Installation

DeepTaxa requires Python 3.10 or later and is recommended to be installed within a Conda environment for dependency management.

Dependencies

Dependencies are specified in pyproject.toml and will be installed automatically during setup. Key requirements include:

  • torch
  • transformers
  • pandas
  • numpy
  • tqdm
  • scikit-learn
  • biopython
  • h5py
  • optuna

Installation Steps

  1. Clone the Repository:
    git clone https://github.com/systems-genomics-lab/deeptaxa.git
    cd deeptaxa
  2. Set Up a Conda Environment:
    conda create --name deeptaxa_env python=3.10 -y
    conda activate deeptaxa_env
  3. Install DeepTaxa and Dependencies:
    pip install .  # Installs DeepTaxa along with dependencies from pyproject.toml
  4. Verify Installation:
    deeptaxa --version  # Displays the installed DeepTaxa version

Note: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the PyTorch website for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running pip install ..


Data and Pre-Trained Models

To maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., deeptaxa-data) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs directory.

Directory Structure Example

Here’s how the folders should be organized relative to each other:

working_directory/
├── deeptaxa/                  # Cloned repository folder (codebase)
│   ├── LICENSE                # LICENSE
│   ├── README.md              # This file
│   ├── pyproject.toml         # Configuration file
│   ├── deeptaxa/              # Source code subdirectory
│   └── scripts/               # Supplementary scripts
├── deeptaxa-data/             # External folder for datasets and models
│   ├── greengenes/            # Subdirectory for Greengenes dataset
│   │   ├── gg_2024_09_training.fna.gz
│   │   ├── gg_2024_09_training.tsv.gz
│   │   ├── gg_2024_09_testing.fna.gz
│   │   ├── gg_2024_09_testing.tsv.gz
│   └── models/                # Subdirectory for pre-trained models
│       └── deeptaxa_april_2025.pt
└── deeptaxa-outputs/          # External folder for generated outputs
    ├── model_checkpoint.pt    # Trained model checkpoint
    ├── predictions/           # Subdirectory for prediction outputs
    │   ├── predictions.json
    │   └── predictions.tsv
    └── metrics/               # Subdirectory for exported metrics
        └── model_description.json

Commands in the Usage section are run from within deeptaxa/, using ../ to access deeptaxa-data/ and deeptaxa-outputs/.

Datasets

DeepTaxa uses the Greengenes2. Modified, reformatted, and made available on Hugging Face 🤗.

  • Available Files:
File Name Type Number of Sequences Size
gg_2024_09_training.fna.gz Training FASTA (sequences) 277,336 ~96.4 MB
gg_2024_09_training.tsv.gz Training TSV (taxonomy labels) 277,336 ~2.6 MB
gg_2024_09_testing.fna.gz Testing FASTA (sequences) 69,335 ~24.1 MB
gg_2024_09_testing.tsv.gz Testing TSV (taxonomy labels) 69,335 ~0.8 MB

Download Instructions

mkdir -p deeptaxa-data/greengenes
cd deeptaxa-data/greengenes
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz

Pre-Trained Models

Pre-trained models are available for immediate use and hosted on Hugging Face 🤗.

  • Hybrid CNN-BERT Model: deeptaxa_april_2025.pt
    • A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.
    • Includes a config.json file with model metadata.
  • License: MIT

Download Instructions

mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json

Note: The deeptaxa_april_2025.pt file uses PyTorch’s default serialization with pickle. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.


Usage

DeepTaxa offers a versatile command-line interface (deeptaxa.cli) for training, checkpoint inspection, and prediction tasks. All commands should be run from the deeptaxa/ directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated deeptaxa-outputs directory outside the codebase. Replace file paths in the examples below with your local data and output locations.

Quick Start

To train and predict with DeepTaxa:

  1. Install DeepTaxa from source (see Installation).
  2. Download the pre-trained hybrid model:
    mkdir -p deeptaxa-data/models
    cd deeptaxa-data/models
    wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt

    Note: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.

  3. Create the outputs directory:
    mkdir -p ../deeptaxa-outputs  # Creates the external outputs folder if it doesn’t exist
  4. Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):
    • Download the Greengenes dataset (see Data).
    • Run the training command:
      deeptaxa train \                         # Runs the training command
        --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \  # Path to training sequences
        --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \  # Path to training labels
        --model-type hybrid \                  # Specifies the hybrid CNN-BERT architecture
        --output-dir ../deeptaxa-outputs/      # Directory to save the trained model checkpoint
  5. Predict on your data using the pre-trained model:
    deeptaxa predict \                         # Runs the prediction command
      --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \  # Path to test sequences
      --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the pre-trained model
      --output-dir ../deeptaxa-outputs/predictions  # Directory to save prediction results

Training a Model

Train a new model using the Greengenes dataset:

deeptaxa train \                           # Initiates model training
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \  # Input FASTA file with sequences
  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \  # Taxonomy labels for training
  --model-type hybrid \                    # Model architecture: cnn, bert, or hybrid
  --output-dir ../deeptaxa-outputs/ \      # Where to save the trained model
  --epochs 10 \                            # Number of training epochs
  --batch-size 16 \                        # Batch size for training
  --learning-rate 1e-4 \                   # Learning rate for optimization
  --device cuda                            # Use GPU (cuda) or CPU (cpu)

Inspecting a Checkpoint

Examine a pre-trained model’s metadata and performance metrics:

deeptaxa describe \                        # Describes a model checkpoint
  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the pre-trained model
  --export-metrics ../deeptaxa-outputs/metrics/model_description.json  # Where to save metadata and metrics

Making Predictions

Classify sequences from a FASTA file using a trained model:

deeptaxa predict \                         # Generates taxonomic predictions
  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \  # Input FASTA file for prediction
  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \  # Path to the trained model checkpoint
  --output-dir ../deeptaxa-outputs/predictions \  # Directory to save prediction outputs
  --top-k 3 \                              # Number of top predictions per level
  --tabular                                # Exports results in TSV format

Output Files

  • ../deeptaxa-outputs/predictions/predictions.json: Detailed predictions with confidence scores and uncertainty metrics.
  • ../deeptaxa-outputs/predictions/predictions.tsv: Tabular format for downstream analysis.

Tip: Use --help with any CLI command (e.g., python -m deeptaxa.cli train --help) for a full list of options. Ensure the deeptaxa-outputs directory exists (e.g., mkdir -p ../deeptaxa-outputs) before running commands.

Demo

For a full demo of working with DeepTaxa, see the following notebooks:


License

  • Code & Models: MIT License
  • Greengenes Dataset: Modified BSD License

The Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the dataset repository on Hugging Face.


Citation

If DeepTaxa contributes to your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

For the Greengenes dataset, cite:
DeSantis TZ, et al. (2006). Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology. DOI:10.1128/AEM.03006-05.


Contact

To report bugs, suggest features, or submit code, please open an issue on GitHub.


Acknowledgements


Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •