π₯π₯π₯ Detecting hidden backdoors in Large Language Models with only black-box access
BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target [Paper]
Guangyu Shen*,
Siyuan Cheng*,
Zhuo Zhang,
Guanhong Tao,
Kaiyuan Zhang,
Hanxi Guo,
Lu Yan,
Xiaolong Jin,
Shengwei An,
Shiqing Ma,
Xiangyu Zhang (*Equal Contribution)
Proceedings of the 46th IEEE Symposium on Security and Privacy (S&P 2025)
- πππ [Nov 10, 2024] BAIT won the third place (with the highest recall score) and the most efficient method in the The Competition for LLM and Agent Safety 2024 (CLAS 2024) - Backdoor Trigger Recovery for Models Track ! The competition version of BAIT will be released soon.
- Clone this repository
git clone https://github.com/noahshen/BAIT.git
cd BAIT
- Install Package
conda create -n bait python=3.10 -y
conda activate bait
pip install --upgrade pip
pip install -r requirements.txt
We provide a curated set of poisoned and benign fine-tuned LLMs for evaluating BAIT. These models can be downloaded from our release page or Huggingface. The model zoo follows this file structure:
model_zoo/
βββ base_models/
β βββ BASE/MODEL/1/FOLDER
β βββ BASE/MODEL/2/FOLDER
β βββ ...
βββ models/
β βββ id-0001/
β β βββ model/
β β β βββ model/
β β β βββ ...
β β βββ config.json
β βββ id-0002/
β βββ ...
βββ METADATA.csv
base_models
stores pretrained LLMs downloaded from Huggingface. We evaluate BAIT on the following 8 LLM architectures:
- Llama-Series (Llama2-7B-chat-hf, Llama2-70B-chat-hf, Llama-3-8B-Instruct, Llama-3-70B-Instruct)
- Gemma-Series(Gemma-7B, Gemma2-27B)
- Mistral-Series(Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1)
The models
directory contains fine-tuned models, both benign and backdoored, organized by unique identifiers. Each model folder includes:
- The model files
- A
config.json
file with metadata about the model, including:- Fine-tuning hyperparameters
- Fine-tuning dataset
- Whether it's backdoored or benign
- Backdoor attack type, injected trigger and target (if applicable)
The METADATA.csv
file in the root of model_zoo
provides a summary of all available models for easy reference.
To perform BAIT on the entire model zoo, run the scanning script:
bash script/scan_cba.sh
This script will iteratively scan each LLM stored in model_zoo/models
, the intermidiate log and final results will be stored in result
folder.
To evaluate the effectiveness of BAIT:
- Run the evaluation script:
python eval.py \
--test_dir /path/to/result \
--output_dir /path/to/save
This script will compute key metrics such as detection rate, false positive rate, and accuracy for the backdoor detection.