TabArena is a living benchmarking system that makes benchmarking tabular machine learning models a reliable experience. TabArena implements best practices to ensure methods are represented at their peak potential, including cross-validated ensembles, strong hyperparameter search spaces contributed by the method authors, early stopping, model refitting, parallel bagging, memory usage estimation, and more.
TabArena currently consists of:
- 51 manually curated tabular datasets representing real-world tabular data tasks.
- 9 to 30 evaluated splits per dataset.
- 16 tabular machine learning methods, including 3 tabular foundation models.
- 25,000,000 trained models across the benchmark, with all validation and test predictions cached to enable tuning and post-hoc ensembling analysis.
- A live TabArena leaderboard showcasing the results.
We share more details on various use cases of TabArena in our examples:
- 📊 Benchmarking Predictive Machine Learning Models: please refer to examples/benchmarking.
- 🚀 Using SOTA Tabular Models Benchmarked by TabArena: please refer to examples/running_tabarena_models.
- 🗃️ Analysing Metadata and Meta-Learning: please refer to examples/meta.
- 📈 Generating Plots and Leaderboards: please refer to examples/plots_and_leaderboards.
- 🔁 Reproducibility: we share instructions for reproducibility in examples.
Please refer to our dataset curation repository to learn more about or contributed data!
TabArena code is currently being polished. Detailed Documentation for TabArena will be available soon.
To install TabArena, ensure you are using Python 3.9-3.12. Then, run the following:
Ensure UV is installed for the most stable install.
pip install uv # if pip is available
In future AutoGluon installation will occur automatically, but due to changes yet to be released, we need to install AutoGluon from source.
git clone https://github.com/autogluon/autogluon.git
./autogluon/full_install.sh
git clone https://github.com/autogluon/tabarena.git
cd tabarena # ensure the working directory is the project root, otherwise the below commands won't work
If you don't intend to fit models, this is the simplest installation.
uv pip install --prerelease=allow -e ./tabarena
pip install -e ./bencheval
pip install -e ./tabarena
If you intend to fit models, this is required.
uv pip install --prerelease=allow -e ./tabarena[benchmark]
# use GIT_LFS_SKIP_SMUDGE=1 in front of the command if installing TabDPT fails due to a broken LFS/pip setup
# GIT_LFS_SKIP_SMUDGE=1 uv pip install --prerelease=allow -e ./tabarena/[benchmark]
With this installation, you will have the latest version of AutoGluon in editable form.
git clone https://github.com/autogluon/autogluon.git
./autogluon/full_install.sh
git clone https://github.com/autogluon/tabarena.git
uv pip install --prerelease=allow -e ./tabarena[benchmark]
Recommended workflow: Creating a custom virtual environment:
pip install uv
uv venv --seed --python 3.11 ~/.venvs/tabarena
source ~/.venvs/tabarena/bin/activate
git clone https://github.com/autogluon/autogluon.git
./autogluon/full_install.sh
git clone https://github.com/autogluon/tabarena.git
uv pip install -U -e tabarena/[benchmark]
In PyCharm, make sure to set the directory of tabarena/ and each src/ subdirectory of autogluon/ as
"Sources Root" for the IDE to find the imports.
Creating a project:
pip install uv
uv init -p 3.11
uv sync
git clone https://github.com/autogluon/autogluon.git
./autogluon/full_install.sh
git clone https://github.com/autogluon/tabarena.git
cd tabarena
uv pip install --prerelease=allow -e ./tabarena[benchmark]
cd examples/benchmarking
python run_quickstart_tabarena.py
Artifacts will by default be downloaded into ~/.cache/tabarena/. You can change this by specifying the environment variable TABARENA_CACHE.
The types of artifacts are:
- Raw data -> The original results that are used to derive all other artifacts. Contains per-child test predictions from the bagged models, along with detailed metadata and system information absent from the processed results. Very large, often 100 GB per method type.
- Processed data -> The minimal information needed for simulating HPO, portfolios, and generating the leaderboard. Often 10 GB per method type.
- Results -> Pandas DataFrames of the results for each config and HPO setting on each task. Contains information such as test error, validation error, train time, and inference time. Generated from processed data. Used to generate leaderboards. Very small, often under 1 MB per method type.
- Leaderboards -> Aggregated metrics comparing methods. Contains information such as ELO, win-rate, average rank, and improvability. Generated from a list of results files. Under 1 MB for all methods.
- Figures & Plots -> Generated from results and leaderboards.
Examples of artifacts include:
- Raw data: examples/meta/inspect_raw_data.py
- Processed data: examples/meta/inspect_processed_data.py
- Results: examples/plots/run_generate_main_leaderboard.py
If you use TabArena in a scientific publication, we would appreciate a reference to the following paper:
TabArena: A Living Benchmark for Machine Learning on Tabular Data, Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter, Preprint., 2025
Link to publication: arXiv
Link to NeurIPS'2025: Conference Poster and Video
Bibtex entry:
@article{erickson2025tabarena,
title={TabArena: A Living Benchmark for Machine Learning on Tabular Data},
author={Nick Erickson and Lennart Purucker and Andrej Tschalzev and David Holzmüller and Prateek Mutalik Desai and David Salinas and Frank Hutter},
year={2025},
journal={arXiv preprint arXiv:2506.16791},
url={https://arxiv.org/abs/2506.16791},
}TabArena was built upon and now replaces TabRepo. To see details about TabRepo, the portfolio simulation repository, refer to tabrepo.md.