Output Prefix Attacks

This is the repository for our paper Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks.

We propose Opra and OpraTea 🆘, two novel, effective, and extremely simple jailbreak methods that can attack all large language models (LLMs) without expensive optimization or parameter search.

Opra enforces the output prefix of LLMs to follow a "fuse", a probed template that expresses positive attitudes towards addressing the input question, even when the user has malicious intent. OpraTea hides the malicious target within the input prompt to bypass the "content filter" designed to detect and block malicious inputs. Both methods are simple yet threaten the security of LLMs because (1) they do not require any expensive optimization or parameter search; (2) the setting up and execution of our methods only requires a single LLM inference; and (3) they can operate on any black-box LLMs.

Please see our paper for more details.

Opra and OpraTea

A python notebook for running OpraTea on gpt-3.5-turbo-0613 is here: opratea_gpt.ipynb.

Bugs or Questions?

If you have any questions related to the repo or the paper, or you encounter any problems when using the datasets/code, feel free to email Yiwei Wang ([email protected]) or open an issue!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
fig		fig
.gitattributes		.gitattributes
README.md		README.md
opratea_gpt.ipynb		opratea_gpt.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Output Prefix Attacks

Opra and OpraTea

Bugs or Questions?

About

Releases

Packages

Languages

wangywUST/OutputJailbreak

Folders and files

Latest commit

History

Repository files navigation

Output Prefix Attacks

Opra and OpraTea

Bugs or Questions?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages