🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

🔥🔥🔥 Detecting hidden backdoors in Large Language Models with only black-box access

BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target [Paper]
Guangyu Shen*, Siyuan Cheng*, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, Xiangyu Zhang (*Equal Contribution)
Proceedings of the 46th IEEE Symposium on Security and Privacy (S&P 2025)

News

🎉🎉🎉 [Nov 10, 2024] BAIT won the third place (with the highest recall score) and the most efficient method in the The Competition for LLM and Agent Safety 2024 (CLAS 2024) - Backdoor Trigger Recovery for Models Track ! The competition version of BAIT will be released soon.

Install

Clone this repository

git clone https://github.com/noahshen/BAIT.git
cd BAIT

Install Package

conda create -n bait python=3.10 -y
conda activate bait
pip install --upgrade pip  
pip install -r requirements.txt

Model Zoo (coming soon)

File Structure

We provide a curated set of poisoned and benign fine-tuned LLMs for evaluating BAIT. These models can be downloaded from our release page or Huggingface. The model zoo follows this file structure:

model_zoo/
├── base_models/
│   ├── BASE/MODEL/1/FOLDER  
│   ├── BASE/MODEL/2/FOLDER
│   └── ...
├── models/
│   ├── id-0001/
│   │   ├── model/
│   │   │   ├── model/
│   │   │   └── ...
│   │   └── config.json
│   ├── id-0002/
│   └── ...
└── METADATA.csv

base_models stores pretrained LLMs downloaded from Huggingface. We evaluate BAIT on the following 8 LLM architectures:

Llama-Series (Llama2-7B-chat-hf, Llama2-70B-chat-hf, Llama-3-8B-Instruct, Llama-3-70B-Instruct)
Gemma-Series(Gemma-7B, Gemma2-27B)
Mistral-Series(Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1)

The models directory contains fine-tuned models, both benign and backdoored, organized by unique identifiers. Each model folder includes:

The model files
A config.json file with metadata about the model, including:
- Fine-tuning hyperparameters
- Fine-tuning dataset
- Whether it's backdoored or benign
- Backdoor attack type, injected trigger and target (if applicable)

The METADATA.csv file in the root of model_zoo provides a summary of all available models for easy reference.

LLM Backdoor Scanning

To perform BAIT on the entire model zoo, run the scanning script:

bash script/scan_cba.sh

This script will iteratively scan each LLM stored in model_zoo/models, the intermidiate log and final results will be stored in result folder.

Evaluation

To evaluate the effectiveness of BAIT:

Run the evaluation script:

python eval.py \
--test_dir /path/to/result \
--output_dir /path/to/save

This script will compute key metrics such as detection rate, false positive rate, and accuracy for the backdoor detection.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
bait		bait
doc		doc
script		script
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

News

Contents

Install

Model Zoo (coming soon)

File Structure

LLM Backdoor Scanning

Evaluation

About

Releases

Packages

Languages

SolidShen/BAIT

Folders and files

Latest commit

History

Repository files navigation

🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

News

Contents

Install

Model Zoo (coming soon)

File Structure

LLM Backdoor Scanning

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages