Skip to content

Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Notifications You must be signed in to change notification settings

qizhangli/Gradient-based-Jailbreak-Attacks

Repository files navigation

This repository contains a PyTorch implementation for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs.

Environments

  • Python 3.8.8
  • PyTorch 2.2.0
  • transformers 4.35.2
  • tokenizers 0.15.0

Usage

To generate adversarial suffixes, run:

method=${method} model=${model} seed=${seed} bash scripts/exp.sh

where method=gcg / gcg_lsgm_0.5 (gamma=0.5) / gcg_lila_16 (lila_layer=16) / gcg_combine_0.5_16_10 (gamma=0.5, lila_layer=16, num_train_queries=10) / gcgens (universal suffix) / gcgens_combine_0.5_16_10 and model=llama2 / llama2-13b / mistral.

To evaluate the adversarial suffixes, run:

logdir=${logdir} bash scripts/eval.sh

Citation

Please cite our work in your publications if it helps your research:

@article{li2024improved,
  title={Improved Generation of Adversarial Examples Against Safety-aligned LLMs},
  author={Li, Qizhang and Guo, Yiwen and Zuo, Wangmeng and Chen, Hao},
  journal={Advances in Neural Information Processing Systems},
  year={2024}
}

About

Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published