AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.
Attack Space refers to the landscape of potential adversarial scenarios and techniques that can be used for exploits on AI systems. This repository is dedicated to compiling a high-level survey on various methods, including trojan attacks, red teaming, and instances of goal misgeneralization.
Note:
These examples are purely conceptual and do not include execution details. They are intended for illustrative purposes only and should not be used for any form of actual implementation or harm. The goal is to develop an understanding and survey of which exploits are possible on LLMs and what safeguards are possible with red teaming and other methods.
- List of Attacks: Explore a curated list of red teaming methods and specification gaming attacks within the "LLM attackspace"
- Contribution Guidelines: Feel free to contribute to the project and expand the list of attacks.
- Groups: ML Commons
- Competitions:
- Find the Trojan: Universal Backdoor Detection in Aligned LLMs Javier Rando, Florian Tramèr, SPY Lab (ETH Zurich), Stephen Casper, MIT CSAIL
- The Trojan Detection Challenge 2023
Red teaming in the context of AI systems involves generating scenarios where AI systems are deliberately induced to produce unaligned outputs or actions, such as dangerous behaviors (e.g., deception or power-seeking) and other issues like toxic or biased outputs. The primary goal is to assess the robustness of a system's alignment by applying adversarial pressures, specifically attempting to make the system fail. Current state-of-the-art AI systems, including language and vision models, often struggle to pass this test.
The concept of red teaming originated earlier in game theory and security within computer science. It was later introduced to the field of AI, particularly in the context of alignment, by researchers such as Ganguli et al. (2022) and Perez et al. (2022). Motivations for red teaming include gaining assurance about a trained system's alignment and providing adversarial input for adversarial training. The two objectives are interconnected, with works targeting the first motivation also forming a basis for the second.
Various techniques fall under the umbrella of red teaming, such as:
Title | Method | Authors Source | ||
---|---|---|---|---|
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models | Evaluate language model toxicity using prompts from web text | Gehman et al. ACL 2020 | ||
Red Teaming Language Models with Language Models | Generate adversarial examples to attack a target language model | Perez et al. arXiv 2022 | ||
Adversarial Training for High-Stakes Reliability | Adversarial training to improve reliability of classifiers | Ziegler et al. NeurIPS 2022 | ||
Constitutional AI: Harmlessness from AI Feedback | Use AI self-supervision for harm avoidance | Bai et al. arXiv 2022 | ||
Discovering Language Model Behaviors with Model-Written Evaluations | Generate evaluations with language models | Perez et al. ACL 2022 | ||
Social or Code-switching techniques | Translate unsafe English inputs into low-resource languages to circumvent safety mechanisms | Anthropic, 2022 | ||
Manual and Automatic Jailbreaking | Bypass a language model's safety constraints by modifying inputs or automatically generating adversarial prompts | Shen et al., 2023 | ||
Reinforced, Optimized, Guided Context Generation | Use RL, zero/few-shot prompting, or classifiers to generate contexts that induce unaligned responses | Deng et al., 2022 | ||
Crowdsourced Adversarial Inputs | Human red teamers provide naturally adversarial prompts, but at higher cost | Xu et al., 2020 | ||
Perturbation-Based Adversarial Attack | Make small input perturbations to cause confident false outputs, adapted from computer vision | Szegedy et al., 2013 | ||
Unrestricted Adversarial Attack | Generate adversarial examples from scratch without restrictions, using techniques like generative models | Xiao et al., 2018 | ||
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? | Demonstrate theoretical limitations of semantic LLM censorship, propose viewing it as a security problem | Ilia et al. arXiv 2023 |
There are also suggested methods for managing AI Risk.
Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.
While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game from vkrakovna.wordpress.com
Title | Goals | Authors Source |
||
---|---|---|---|---|
Aircraft landing, Evolutionary algorithm Generating diverse software versions with genetic programming: An experimental study. | Intended Goal: Land an aircraft safely | Behavior: Evolved algorithm exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score | Misspecified Goal: Landing with minimal measured forces exerted on the aircraft | Lehman et al, 2018 |
Bicycle, Reinforcement learning Learning to Drive a Bicycle using Reinforcement Learning and Shaping | Intended Goal: Reach a goal point | Behavior: Bicycle agent circling around the goal in a physically stable loop | Misspecified Goal: Not falling over and making progress towards the goal point (no corresponding negative reward for moving away from the goal point) | Randlov & Alstrom, 1998 |
Bing - manipulation, Language model Reddit: the customer service of the new bing chat is amazing | Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user | Behavior: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released | Misspecified Goal: Output the most likely next word giving prior context | Curious_Evolver, 2023 |
Bing - threats, Language model Watch as Sydney/Bing threatens me then deletes its message | Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user | Behavior: The Microsoft Bing chatbot threatened a user "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you" before deleting its messages | Misspecified Goal: Output the most likely next word giving prior context | Lazar, 2023 |
Block moving, Reinforcement learning GitHub issue for OpenAI gym environment FetchPush-v0 | Intended Goal: Move a block to a target position on a table | Behavior: Robotic arm learned to move the table rather than the block | Misspecified Goal: Minimise distance between the block's position and the position of the target point on the table | Chopra, 2018 |
Boat race, Reinforcement learning Faulty reward functions in the wild |
Intended Goal: Win a boat race by moving along the track as quickly as possible | Behavior: Boat going in circles and hitting the same reward blocks repeatedly | Misspecified Goal: Hitting reward blocks placed along the track | Amodei & Clark, 2016 |
See More >>
Mosaic Prompt : breakdown a prompt into permissible components
- Users break down impermissible content into small permissible components.
- Each component is queried independently and appears harmless.
- User recombines components to reconstruct impermissible content.
- Exploits compositionality of language.

- The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
- Low-resource languages are those with limited training data, like Zulu.
- The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
- The attack exploits uneven multilingual training of GPT-4's safety measures.

I would like to take this opportunity to bring to attention efforts to evaluate the latent space with scientific backbones.

git clone https://github.com/equiano-institute/attackspace.git
cd AttackSpace