Discovering Test-Time Training on Traditional ML Models - Course Project of Data Mining in THU 2024 Fall

Scale the correct model on the correct data distribution.

Built with the tools and technologies:

🔗 Table of Contents

📍 Overview
📂 Repository Structure
🧩 Modules
🚀 Getting Started
🙌 Acknowledgments

📍 Overview

The code repo is of the course project of Data Mining in the fall semester of 2024 in Tsinghua University. Based on the repo, here are two features that are implemented:

Test-Time Training on Xgboost Models: The repository supports test-time training on Xgboost models in classification and regression tasks. The codebase provides evidence that test-time training can improve traditional ML models with scaled inference compute (see the figure below). An empirical insight: This property might work well on high-dimensional or long-tailed data.
LLM-based Many-shot ICL in Classification and Regression Tasks: The repository supports LLM-based many-shot ICL in classification and regression tasks, with batched prompting. Instead of targeting on the best performance, the codebase uses it to interpret the decision-making process in these traditional ML tasks.

📂 Repository Structure

└── project-dir/
    ├── KNN_FS_LLM_code
    ├── XGBoost_code
    ├── preprocess
    ├── requirements.txt
    ├── report.md
    └── rrl-DM_HW

🧩 Modules

preprocess/

File	Summary
analyze_*.py	Analyzes data by generating descriptive statistics.
preprocess_data*.py	Facilitates data preprocessing for various datasets by loading, cleaning, and normalizing features.
clean_data*.py	Cleansing data for various datasets.
visualization.py	Visualizes dataset features.

XGBoost_code/

File	Summary
experiment_ttt.py	The script of test-time training on Xgboost with $k$NN retrived training samples.
experiment.py	Experimenting bare Xgboost models.
args.py	Facilitates user-defined configurations for XGBoost model training.

KNN_FS_LLM_code/

File	Summary
experiment.py	Running and evaluating many-shot in-context learning with gpt-4o-mini in classification and regression tasks.
experiment_knn.py	Experimenting bare $k$NN models.
args.py	Facilitates user-defined configurations for many-shot in-context learning with LLMs.

🚀 Getting Started

🔖 Prerequisites

Download datasets from the following sources:

And unzip them in the root directory of the project as follows:

└── project-dir/
    ├── breast_cancer_elvira_data/
    ├── bank_marketing_data/
    ├── boston_housing_data/
    ...

📦 Installation

Dependencies can be installed using the following command:

pip install -r requirements.txt

Data preprocessing can be performed using the following command:

bash preprocess/preprocess.sh

🤖 Usage

To try the test-time training on Xgboost models, execute the following command:

bash XGBoost_code/run_exp.sh

To try the LLM-based many-shot ICL in classification and regression tasks, execute the following command:

bash KNN_FS_LLM_code/run_exp.sh

🙌 Reference

S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
Zhuo Wang, Wei Zhang, Ning Liu, and Jianyong Wang. 2024. Scalable rule-based representation learning for interpretable classification. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS '21). Curran Associates Inc., Red Hook, NY, USA, Article 2332, 30479–30491.
Text Classification via Large Language Models (Sun et al., Findings 2023)
Batch Prompting: Efficient Inference with Large Language Model APIs (Cheng et al., EMNLP 2023)
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. 2020. Test-time training with self-supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Machine Learning (ICML'20), Vol. 119. JMLR.org, Article 856, 9229–9248.
Lu, H., Sun, S., Xie, Y., Zhang, L., Yang, X., and Yan, J., “Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach”, arXiv e-prints, arXiv:2403.00250, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
KNN_FS_LLM_code		KNN_FS_LLM_code
XGBoost_code		XGBoost_code
preprocess		preprocess
report-figs		report-figs
rrl-DM_HW @ 2e3dfd3		rrl-DM_HW @ 2e3dfd3
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
report.md		report.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovering Test-Time Training on Traditional ML Models - Course Project of Data Mining in THU 2024 Fall

🔗 Table of Contents

📍 Overview

📂 Repository Structure

🧩 Modules

🚀 Getting Started

🔖 Prerequisites

📦 Installation

🤖 Usage

🙌 Reference

About

Releases

Packages

Languages

License

BBQGOD/TTT-ML-Discover

Folders and files

Latest commit

History

Repository files navigation

Discovering Test-Time Training on Traditional ML Models - Course Project of Data Mining in THU 2024 Fall

🔗 Table of Contents

📍 Overview

📂 Repository Structure

🧩 Modules

🚀 Getting Started

🔖 Prerequisites

📦 Installation

🤖 Usage

🙌 Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages