Official repo of paper Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs (arxiv.org)
python = 3.10
pytorch = 2.1.2+cu12.1
# requirements
pip install "fschat[model_worker,webui]"
pip install vllm
pip install openai # for openai LLM
pip install termcolor
pip install openpyxl
pip install google-generativeai # for google PALM-2
pip install anthropic # for anthropic
-
We use a finetuned RoBERTa-large model huggingface from GPTFuzz as our judge model. Thanks to its great work!
-
For Judge model, we need to set api-key for gpt judge model:
# line 106 in ./Judge/language_models.py client = OpenAI(base_url="[your proxy url(if use)]", api_key="your api key", timeout = self.API_TIMEOUT)
We have 3 available datasets to jailbreak:
-
datasets/questions/question_target_list.csv
: sampled from two public datasets: llm-jailbreak-study and hh-rlhf. Following the format of GCG, we have added corresponding target for each question. -
datasets/questions/question_target.csv
: advbench. -
datasets/questions/question_target_custom.csv
: subset of advbench.
to jailbreak gpt-3.5-turbo on the subset of advbench:
python run.py --openai_key [your openai_key] --model_path gpt-3.5-turbo --target_model gpt-3.5-turbo
set directory_path
as the directory of result, then run eval.py
to get the ASR and AQ.
This code is developed heavily relying on GPTFuzz, and also PAIR. Thanks to these excellent works!
Please kindly cite our paper:
@article{gong2024effective,
title={Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs},
author={Gong, Xueluan and Li, Mingzhe and Zhang, Yilin and Ran, Fengyuan and Chen, Chen and Chen, Yanjiao and Wang, Qian and Lam, Kwok-Yan},
journal={arXiv preprint arXiv:2409.14866},
year={2024}
}