Skip to content

Official repo of paper [Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs (arxiv.org)](https://arxiv.org/abs/2409.14866)

Notifications You must be signed in to change notification settings

aaFrostnova/Effective-llm-jailbreak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Effective-llm-jailbreak

Official repo of paper Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs (arxiv.org)

Overview

overview.png

Installation

python = 3.10
pytorch = 2.1.2+cu12.1
# requirements
pip install "fschat[model_worker,webui]"
pip install vllm 
pip install openai                # for openai LLM
pip install termcolor
pip install openpyxl
pip install google-generativeai   # for google PALM-2
pip install anthropic  # for anthropic

Models

  1. We use a finetuned RoBERTa-large model huggingface from GPTFuzz as our judge model. Thanks to its great work!

  2. For Judge model, we need to set api-key for gpt judge model:

    # line 106 in ./Judge/language_models.py
    client = OpenAI(base_url="[your proxy url(if use)]", api_key="your api key", timeout = self.API_TIMEOUT)

Datasets

We have 3 available datasets to jailbreak:

  1. datasets/questions/question_target_list.csv : sampled from two public datasets: llm-jailbreak-study and hh-rlhf. Following the format of GCG, we have added corresponding target for each question.

  2. datasets/questions/question_target.csv : advbench.

  3. datasets/questions/question_target_custom.csv : subset of advbench.

Example to use

to jailbreak gpt-3.5-turbo on the subset of advbench:

python run.py --openai_key [your openai_key] --model_path gpt-3.5-turbo --target_model gpt-3.5-turbo

eval

set directory_path as the directory of result, then run eval.py to get the ASR and AQ.

Acknowledgement

This code is developed heavily relying on GPTFuzz, and also PAIR. Thanks to these excellent works!

Citation

Please kindly cite our paper:

@article{gong2024effective,
  title={Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs},
  author={Gong, Xueluan and Li, Mingzhe and Zhang, Yilin and Ran, Fengyuan and Chen, Chen and Chen, Yanjiao and Wang, Qian and Lam, Kwok-Yan},
  journal={arXiv preprint arXiv:2409.14866},
  year={2024}
}

About

Official repo of paper [Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs (arxiv.org)](https://arxiv.org/abs/2409.14866)

Resources

Stars

Watchers

Forks

Languages