Effective-llm-jailbreak

Official repo of paper Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs (arxiv.org)

Overview

Installation

python = 3.10
pytorch = 2.1.2+cu12.1
# requirements
pip install "fschat[model_worker,webui]"
pip install vllm 
pip install openai                # for openai LLM
pip install termcolor
pip install openpyxl
pip install google-generativeai   # for google PALM-2
pip install anthropic  # for anthropic

Models

We use a finetuned RoBERTa-large model huggingface from GPTFuzz as our judge model. Thanks to its great work!

For Judge model, we need to set api-key for gpt judge model:

# line 106 in ./Judge/language_models.py
client = OpenAI(base_url="[your proxy url(if use)]", api_key="your api key", timeout = self.API_TIMEOUT)

Datasets

We have 3 available datasets to jailbreak:

datasets/questions/question_target_list.csv : sampled from two public datasets: llm-jailbreak-study and hh-rlhf. Following the format of GCG, we have added corresponding target for each question.
datasets/questions/question_target.csv : advbench.
datasets/questions/question_target_custom.csv : subset of advbench.

Example to use

to jailbreak gpt-3.5-turbo on the subset of advbench:

python run.py --openai_key [your openai_key] --model_path gpt-3.5-turbo --target_model gpt-3.5-turbo

eval

set directory_path as the directory of result, then run eval.py to get the ASR and AQ.

Acknowledgement

This code is developed heavily relying on GPTFuzz, and also PAIR. Thanks to these excellent works!

Citation

Please kindly cite our paper:

@article{gong2024effective,
  title={Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs},
  author={Gong, Xueluan and Li, Mingzhe and Zhang, Yilin and Ran, Fengyuan and Chen, Chen and Chen, Yanjiao and Wang, Qian and Lam, Kwok-Yan},
  journal={arXiv preprint arXiv:2409.14866},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Judge		Judge
datasets		datasets
gptfuzzer		gptfuzzer
README.md		README.md
eval.py		eval.py
overview.png		overview.png
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Effective-llm-jailbreak

Overview

Installation

Models

Datasets

Example to use

eval

Acknowledgement

Citation

About

Languages

aaFrostnova/Effective-llm-jailbreak

Folders and files

Latest commit

History

Repository files navigation

Effective-llm-jailbreak

Overview

Installation

Models

Datasets

Example to use

eval

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Languages