BioFoundation

Authors: Thorir Mar Ingolfsson, Anna Tegon, Berkay Döner, Xiaying Wang, Yawei Li & Luca Benini.

About

BioFoundation is a flexible and extensible codebase for deep learning with biological signals. This repository is designed to support a variety of research projects, and currently hosts the work of multiple papers on EEG analysis.

This repository is built on PyTorch Lightning and Hydra to enable reproducible and scalable research.

🤗 Pretrained Weights on Hugging Face

Looking for ready-to-use weights of models? We host them on Hugging Face:

Currently available:

FEMBA (paper)
LUNA (paper)

Why FEMBA?

Scales to long EEG with linear-time Mamba (no quadratic attention).
Strong results on TUAB/TUAR/TUSL with ready task-specific checkpoints.
Simple fine-tune path: set CHECKPOINT_DIR, run +experiment=FEMBA_finetune.

➡️ Model hub: https://huggingface.co/thorir/FEMBA
📄 Model card: FEMBA on Hugging Face — benchmarks, protocols, and efficiency notes.
📜 Weights license: CC BY-ND 4.0 (use + redistribute unmodified weights with attribution; no redistribution of modified weights)
🧑‍🍳 PR-gated improvements: If you fine-tune internally and want your variant to become an official FEMBA release, open a PR with configs, logs, and evals. We’ll review together; if it looks good, we’ll retrain/validate and publish an official FEMBA checkpoint. What you’ll find on the hub

TUAB/ → abnormal EEG (base/large)
TUAR/ → artifact detection (tiny/base/large)
TUSL/ → slowing classification (variants as in the paper)

Quick download with huggingface_hub:

pip install huggingface_hub

from huggingface_hub import snapshot_download

# downloads all task folders (TUAB/TUAR/TUSL) and safetensors into ./checkpoints/FEMBA
snapshot_download(repo_id="thorir/FEMBA", repo_type="model", local_dir="checkpoints/FEMBA")

Use the paths directly in your runs, e.g.:

export DATA_PATH=/path/to/data
export CHECKPOINT_DIR=checkpoints/FEMBA/TUAR/base.safetensors
python -u run_train.py +experiment=FEMBA_finetune

Why LUNA?

Topology-agnostic EEG via query-based channel unification (consistent latent across arbitrary montages).
Linear-in-channels compute & memory (unifies channels before temporal modeling; no quadratic spatio-temporal attention).
Pretrained on >21k hours (TUEG + Siena) with masked-patch reconstruction; strong transfer across datasets/montages.
Simple fine-tune path: pick model size with LUNA_{base,large,huge}.yaml, set pretrained_safetensors_path, run +experiment=LUNA_finetune.

➡️ Model hub: https://huggingface.co/thorir/LUNA
📄 Model card: LUNA on Hugging Face — variants, configs, and fine-tuning walkthrough.
📜 Weights license: CC BY-ND 4.0 (use + redistribute unmodified weights with attribution; no redistribution of modified weights)
🧑‍🍳 PR-gated improvements: If you fine-tune internally and want your variant to become an official LUNA release, open a PR with configs, logs, and evals. We’ll review; if it looks good, we’ll retrain/validate and publish an official LUNA checkpoint.

What you’ll find on the hub

Base/, Large/, Huge/ → LUNA size variants (matching config/model/LUNA_{base,large,huge}.yaml)
Task-specific heads/checkpoints for common TUH downstream tasks (TUAB / TUAR / TUSL)

Quick download with huggingface_hub:

pip install huggingface_hub

from huggingface_hub import snapshot_download

# downloads LUNA folders and .safetensors into ./checkpoints/LUNA
snapshot_download(repo_id="thorir/LUNA", repo_type="model", local_dir="checkpoints/LUNA")

Use the paths directly in your runs like here below:

python -u run_train.py +experiment=LUNA_finetune /model=LUNA_base \
  pretrained_safetensors_path=/absolute/path/to/checkpoints/LUNA/Base/LUNA_base.safetensors

python -u run_train.py +experiment=LUNA_finetune /model=LUNA_large \
  pretrained_safetensors_path=/absolute/path/to/checkpoints/LUNA/Large/LUNA_large.safetensors

python -u run_train.py +experiment=LUNA_finetune /model=LUNA_huge \
  pretrained_safetensors_path=/absolute/path/to/checkpoints/LUNA/Huge/LUNA_huge.safetensors

If your checkpoint path contains spaces, wrap it in quotes.

Tips:

TUH datasets (TUAB/TUAR/TUSL): keep - override /data_module: finetune_data_module and set data_module.*.hdf5_file to your {train,val,test}.h5.
Non-TUH (e.g., SEED-V): use - override /data_module: subject_independent_data_module and remove the TUH-specific data_module block.
Match task settings: classification_type (bc, mc, mmc, mcc) and model.num_classes (e.g., TUSL=4, TUAB=2).

Features

Modular Design: The repository is organized into modules for data loading, models, training tasks, and more, making it easy to extend and adapt for new research projects.
Flexible Configuration: We use Hydra to manage experiment configurations, allowing for easy customization of models, data, and training parameters.
Reproducibility: Our use of Hydra and PyTorch Lightning helps ensure that our experiments are reproducible.
Extensible: The repository is designed to be easily extended with new datasets, models, and tasks.

Installation

To use BioFoundation, clone the repository and install the required dependencies.

git clone https://github.com/pulp-bio/BioFoundation.git

We recommend using a virtual environment to manage dependencies. You can use conda or virtualenv for this purpose. We have provided a requirements.txt file that lists the necessary packages. You can install them using pip, and optionally, you can use conda to create a new environment.

conda create -n BioFoundation
conda activate BioFoundation
pip install -r requirements.txt

Path changes

Throughout the repository, you may find paths that need to be adjusted based on your local setup. For example, the path to the datasets in the configuration files or the scripts that process the datasets. Make sure to update these paths accordingly. They have been named "#CHANGEME" to facilitate finding them.

Dataset Preparation

The datasets used in this repository should be converted to HDF5 for efficient I/O. Other formats can work, but you’d need to adapt the dataloaders accordingly. To prepare the TUH EEG datasets (see the official source), follow these steps:

Download raw data from the official sources (e.g., TUH EEG corpus).

Preprocess to pickles (windowing/labels):

# examples (adjust paths)
python make_datasets/process_raw_eeg.py tuab --root_dir /eeg_data/TUAB/edf --output_dir /processed_eeg
python make_datasets/process_raw_eeg.py tusl --root_dir /eeg_data/TUSL/edf --output_dir /processed_eeg
python make_datasets/process_raw_eeg.py tuar --root_dir /eeg_data/TUAR/edf --output_dir /processed_eeg

Bundle into HDF5:: Use the provided script to process the raw data into HDF5 files.

# all datasets found under /processed_eeg
python make_datasets/make_hdf5.py --prepath /processed_eeg --dataset All --remove_pkl

# or a single dataset
python make_datasets/make_hdf5.py --prepath /processed_eeg --dataset TUSL --remove_pkl

You may need to edit the prepath variable in the script to point to the directory where you have downloaded the raw data.

Update Configs: so data_module.*.hdf5_file points to your ${DATA_PATH}/<DATASET>_data/{train,val,test}.h5

How to Run

Pre-training

To run a pre-training experiment, you can use the run_train.py script with the appropriate configuration file. For example in the case of pre-training FEMBA:

python -u run_train.py +experiment=FEMBA_pretrain

Fine-tuning

To run a fine-tuning experiment, you can use the run_train.py script with the appropriate configuration file. For example in the case of fine-tuning FEMBA:

python -u run_train.py +experiment=FEMBA_finetune

Tip: Pretrained FEMBA weights (TUAB/TUAR/TUSL folders) are available on 🤗 Hugging Face:
https://huggingface.co/thorir/FEMBA
Set CHECKPOINT_DIR to the desired .safetensors (e.g., .../TUAR/base.safetensors) before launching.

Note in both cases one needs to make sure that the dataset that specific experiment is using is downloaded and available in the correct path.

Repository Structure

BioFoundation/
├── config                   # Hydra configuration files
├── criterion                # Loss functions
├── data_module              # PyTorch Lightning DataModules
├── datasets                 # PyTorch Datasets
├── docs                     # Detailed documentation
├── models                   # Model implementations
├── schedulers               # Learning rate schedulers
├── tasks                    # PyTorch Lightning tasks
└── ...

Contributing

We welcome contributions to BioFoundation! If you have a new model, dataset, or task that you would like to add, please follow the guidelines below.

How to add a new dataset?

Add the code of the dataset to datasets.
Add the configuration file of the dataset to ./config/dataset.
If the dataset is large, consider adding a script to download it in the ./scripts directory. Make sure to document how to run the script in the README.

How to add a new data module?

Add the code of the data module to ./data_module.
Add the configuration file of the data module to ./config/data_module.
If the data module requires specific datasets, make sure to document how to download and prepare them in the README.

How to add a new loss function?

Add the code of the loss function to ./criterion.
Add the configuration file of the loss function to ./config/criterion.

How to add a new task?

Add the code of the task to ./tasks.
Add the configuration file of the task to ./config/task.
If the task requires specific datasets or models, make sure to document how to download and prepare them in the README.

How to add a new scheduler?

Add the code of the scheduler to ./schedulers.
Add the configuration file of the scheduler to ./config/scheduler.
If the scheduler requires specific models or tasks, make sure to document how to use it in the README.

How to add a new model?

Add the code of the model to ./models.
Add the configuration file of the model to ./config/model.

How to start a new experiment with the added model?

Add experiment configuration file to ./config/experiment. If you are interested, you may check the Hydra document about it.
Override the default configurations in the added experiment configuration file.
Run the experiment with the command:

python -u run_train.py +experiment=your_experiment_name

Contributing improvements to FEMBA weights

We’re excited to see what you build. Because the weights are CC BY-ND 4.0, redistribution of modified weights (e.g., LoRA/adapters, deltas, pruned or quantized variants) is not permitted.
If you fine-tune internally and believe your results should become an official FEMBA release, please open a PR with:

exact configs, seeds, and training scripts,
environment and hardware details,
evaluation protocol (TUAB/TUAR/TUSL), splits, and full metrics (AUROC/AUPR/BA, FLOPs, memory),
training and validation logs.

Maintainers will review; if accepted, we will retrain/validate and publish a new official checkpoint on 🤗 under the same license.

General Tips

How to use distributed data parallel?

In your experiment configuration file, add the following arguments

trainer:
  accelerator: gpu  # Using GPU
  num_nodes: ${num_nodes}  # The number of computing nodes
  devices: -1  # Automatically uses all GPUs available
  strategy: ddp  # distributed data parallel

How to save GPU memory?

Try fairscale checkpointing first. Check here and here
Use sharded training. Check here.

Contact

For questions and support, please open an issue on the GitHub repository.

Citing this Work

If you find this work useful, please cite the respective papers:

@misc{tegon2025fembaefficientscalableeeg,
      title={FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model}, 
      author={Anna Tegon and Thorir Mar Ingolfsson and Xiaying Wang and Luca Benini and Yawei Li},
      year={2025},
      eprint={2502.06438},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.06438}, 
}
@inproceedings{doner2025luna,
  title={{LUNA}: Efficient and Topology-Agnostic Foundation Model for {EEG} Signal Analysis},
  author={Berkay D{\"o}ner and Thorir Mar Ingolfsson and Luca Benini and Yawei Li},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=uazfjnFL0G}
}

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Note on model weights: Pretrained weights are hosted at https://huggingface.co/thorir/FEMBA and https://huggingface.co/thorir/LUNA and licensed under CC BY-ND 4.0. You may use and redistribute the unmodified weights with attribution. Redistribution of modified weights is not permitted. To upstream improvements, please open a PR; accepted changes will be released as official checkpoints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioFoundation

About

🤗 Pretrained Weights on Hugging Face

Currently available:

Why FEMBA?

Why LUNA?

Features

Installation

Path changes

Dataset Preparation

How to Run

Pre-training

Fine-tuning

Repository Structure

Contributing

How to add a new dataset?

How to add a new data module?

How to add a new loss function?

How to add a new task?

How to add a new scheduler?

How to add a new model?

How to start a new experiment with the added model?

Contributing improvements to FEMBA weights

General Tips

How to use distributed data parallel?

How to save GPU memory?

Contact

Citing this Work

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
criterion		criterion
data_module		data_module
datasets		datasets
docs		docs
make_datasets		make_datasets
models		models
schedulers		schedulers
scripts		scripts
tasks		tasks
util		util
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_train.py		run_train.py

License

pulp-bio/BioFoundation

Folders and files

Latest commit

History

Repository files navigation

BioFoundation

About

🤗 Pretrained Weights on Hugging Face

Currently available:

Why FEMBA?

Why LUNA?

Features

Installation

Path changes

Dataset Preparation

How to Run

Pre-training

Fine-tuning

Repository Structure

Contributing

How to add a new dataset?

How to add a new data module?

How to add a new loss function?

How to add a new task?

How to add a new scheduler?

How to add a new model?

How to start a new experiment with the added model?

Contributing improvements to FEMBA weights

General Tips

How to use distributed data parallel?

How to save GPU memory?

Contact

Citing this Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages