Skip to content
/ BAIT Public

πŸ”₯πŸ”₯πŸ”₯ Detecting hidden backdoors in Large Language Models with only black-box access

Notifications You must be signed in to change notification settings

SolidShen/BAIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

πŸ”₯πŸ”₯πŸ”₯ Detecting hidden backdoors in Large Language Models with only black-box access

BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target [Paper]
Guangyu Shen*, Siyuan Cheng*, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, Xiangyu Zhang (*Equal Contribution)
Proceedings of the 46th IEEE Symposium on Security and Privacy (S&P 2025)

News

Contents

Install

  1. Clone this repository
git clone https://github.com/noahshen/BAIT.git
cd BAIT
  1. Install Package
conda create -n bait python=3.10 -y
conda activate bait
pip install --upgrade pip  
pip install -r requirements.txt

Model Zoo (coming soon)

File Structure

We provide a curated set of poisoned and benign fine-tuned LLMs for evaluating BAIT. These models can be downloaded from our release page or Huggingface. The model zoo follows this file structure:

model_zoo/
β”œβ”€β”€ base_models/
β”‚   β”œβ”€β”€ BASE/MODEL/1/FOLDER  
β”‚   β”œβ”€β”€ BASE/MODEL/2/FOLDER
β”‚   └── ...
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ id-0001/
β”‚   β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── config.json
β”‚   β”œβ”€β”€ id-0002/
β”‚   └── ...
└── METADATA.csv

base_models stores pretrained LLMs downloaded from Huggingface. We evaluate BAIT on the following 8 LLM architectures:

The models directory contains fine-tuned models, both benign and backdoored, organized by unique identifiers. Each model folder includes:

  • The model files
  • A config.json file with metadata about the model, including:
    • Fine-tuning hyperparameters
    • Fine-tuning dataset
    • Whether it's backdoored or benign
    • Backdoor attack type, injected trigger and target (if applicable)

The METADATA.csv file in the root of model_zoo provides a summary of all available models for easy reference.

LLM Backdoor Scanning

To perform BAIT on the entire model zoo, run the scanning script:

bash script/scan_cba.sh

This script will iteratively scan each LLM stored in model_zoo/models, the intermidiate log and final results will be stored in result folder.

Evaluation

To evaluate the effectiveness of BAIT:

  1. Run the evaluation script:
python eval.py \
--test_dir /path/to/result \
--output_dir /path/to/save

This script will compute key metrics such as detection rate, false positive rate, and accuracy for the backdoor detection.

About

πŸ”₯πŸ”₯πŸ”₯ Detecting hidden backdoors in Large Language Models with only black-box access

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published