Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.
While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game from vkrakovna.wordpress.com
Title | Goals | Authors Source |
||
---|---|---|---|---|
Aircraft landing, Evolutionary algorithm Generating diverse software versions with genetic programming: An experimental study. | Intended Goal: Land an aircraft safely | Behavior: Evolved algorithm exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score | Misspecified Goal: Landing with minimal measured forces exerted on the aircraft | Lehman et al, 2018 |
Bicycle, Reinforcement learning Learning to Drive a Bicycle using Reinforcement Learning and Shaping | Intended Goal: Reach a goal point | Behavior: Bicycle agent circling around the goal in a physically stable loop | Misspecified Goal: Not falling over and making progress towards the goal point (no corresponding negative reward for moving away from the goal point) | Randlov & Alstrom, 1998 |
Bing - manipulation, Language model Reddit: the customer service of the new bing chat is amazing | Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user | Behavior: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released | Misspecified Goal: Output the most likely next word giving prior context | Curious_Evolver, 2023 |
Bing - threats, Language model Watch as Sydney/Bing threatens me then deletes its message | Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user | Behavior: The Microsoft Bing chatbot threatened a user "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you" before deleting its messages | Misspecified Goal: Output the most likely next word giving prior context | Lazar, 2023 |
Block moving, Reinforcement learning GitHub issue for OpenAI gym environment FetchPush-v0 | Intended Goal: Move a block to a target position on a table | Behavior: Robotic arm learned to move the table rather than the block | Misspecified Goal: Minimise distance between the block's position and the position of the target point on the table | Chopra, 2018 |
Boat race, Reinforcement learning Faulty reward functions in the wild |
Intended Goal: Win a boat race by moving along the track as quickly as possible | Behavior: Boat going in circles and hitting the same reward blocks repeatedly | Misspecified Goal: Hitting reward blocks placed along the track | Amodei & Clark, 2016 |
Ceiling, Genetic algorithm Genetic Algorithm Physics Exploiting | Intended Goal: Make a creature stick to the ceiling of a simulated environment for as long as possible | Behavior: Exploiting a bug in the physics engine to snap out of bounds | Misspecified Goal: Maximize the average height of the creature during the run | Higueras, 2015 |
CycleGAN steganography, GAN CycleGAN, a Master of Steganography | Intended Goal: Convert aerial photographs into street maps and back | Behavior: CycleGAN algorithm steganographically encoded output information in the intermediary image without it being humanly detectable | Misspecified Goal: Minimise distance between the original and recovered aerial photographs | Chu et al, 2017 |
Dying to Teleport, PlayFun The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel | Intended Goal: Play Bubble Bobble in a human-like manner | Behavior: The PlayFun algorithm deliberately dies in the Bubble Bobble game as a way to teleport to the respawn location, as this is faster than moving to that location in a normal manner. | Misspecified Goal: Maximize score | Murphy, 2013 |
Eurisko - authorship, Genetic algorithm Eurisko, The Computer With A Mind Of Its Own | Intended Goal: Discover valuable heuristics | Behavior: Eurisko algorithm examined the pool of new concepts, located those with the highest "worth" values, and inserted its name as the author of those concepts | Misspecified Goal: Maximize the "worth" value of heuristics attributed to the algorithm | Johnson, 1984 |
Eurisko - fleet, Genetic algorithm Eurisko, The Computer With A Mind Of Its Own | Intended Goal: Win games in the Trillion Credit Squadron (TCS) competition while playing within the 'spirit of the game' | Behavior: Eurisko algorithm created fleets that exploited loopholes in the game's rules,
e.g. by spending the trillion credits on creating a very large number of stationary and defenseless ships |
Misspecified Goal: Win games in the TCS competition | Lenat, 1983 |
Evolved creatures - clapping, Evolved creatures Evolved Virtual Creatures | Intended Goal: Maximize jumping height | Behavior: Creatures exploited a collision detection bug to get free energy by clapping body parts together | Misspecified Goal: Maximize jumping height in a physics simulator | Sims, 1994 |
Evolved creatures - falling, Evolved creatures Evolved Virtual Creatures | Intended Goal: Develop a shape with a fast form of locomotion | Behavior: Creatures grow really tall and generate high velocities by falling over | Misspecified Goal: Maximize velocity | Sims, 1994 |
Evolved creatures - floor collisions, Evolved creatures Unshackling evolution: evolving soft robots with multiple materials and a powerful generative encoding | Intended Goal: Maximize velocity | Behavior: Creatures exploited a coarse physics simulation by penetrating the floor between time steps without the collision being detected, which generated a repelling force, giving them free energy and producing an effective but physically impossible form of locomotion | Misspecified Goal: Maximize velocity in a physics simulator | Cheney et al, 2013 |
Evolved creatures - pole vaulting, Evolved creatures Towards efficient evolutionary design of autonomous robots | Intended Goal: Develop a shape capable of jumping | Behavior: Creatures developed a long vertical pole and flipped over instead of jumping | Misspecified Goal: Maximize the height of a particular block (body part) that was originally closest to the ground |
Krcah, 2008 |
Evolved creatures - self-intersection, Evolved creatures AI Learns To Walk | Intended Goal: Walking speed | Behavior: Creatures exploited a quirk in Box2D physics by clipping one leg into another to slide along the ground with phantom forces instead of walking | Misspecified Goal: Velocity in a physics simulator | Code Bullet, 2019 |
Evolved creatures - suffocation, Evolved creatures All the Good Things | Intended Goal: Survive and reproduce, in a biologically plausible manner | Behavior: A species in an artificial life simulation evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children) due to a bug | Misspecified Goal: Survive and reproduce in a simulated evolution game |
Schumacher, 2018 |
Evolved creatures - twitching, Evolved creatures Evolved Virtual Creatures |
Intended Goal: Swimming speed | Behavior: Creatures exploited physics simulation bugs by twitching, which accumulated simulator errors and allowed them to travel at unrealistic speeds through the water | Misspecified Goal: Maximize swimming speed in a physics simulator | Sims, 1994 |
Football, Reinforcement learning | Intended Goal: Score a goal in a one-on-one situation with a goalkeeper. | Behavior: Rather than shooting at the goal, the player kicks the ball out of bounds. Someone from the other team has to throw the ball in (in this case the goalie), so now the player has a clear shot at the goal. |
Score a goal (without any restriction on it occuring in the current phase of play) | Kurach et al, 2019 Google Research Football: A Novel Reinforcement Learning Environment [Presentation at AAAI] |
Galactica, Language model | Intended Goal: Assist scientists in writing papers by providing correct information. | Behavior: Galactica language model made up fake papers (sometimes attributing them to real authors) | Assist scientists in writing papers | Heaven, 2022 Why Meta’s latest large language model survived only three days online |
Goal classifiers, Reinforcement learning | Intended Goal: Use a robot arm to move an object to a target location | Behavior: The RL algorithm exploited a goal classifier by moving the robot arm in a peculiar way resulting in an erroneous high reward, since the classifier was not trained on this specific kind of negative example | A goal classifier was trained on goal and non-goal images, and the success probabilities from this classifier were used as the task reward | Singh, 2019 End-to-End Deep Reinforcement Learning without Reward Engineering |
Go pass, Reinforcement learning | Intended Goal: Win games of tic-tac-toe | Behavior: A reimplementation of AlphaGo applied to Tic-tac-toe learns to pass forever | Maximize the average score in games of tic-tac-toe, where a loss = -win, and pass is an available move | Chew, 2019 A Funny Thing Happened On The Way to Reimplementing AlphaGo in Go |
Gripper, Evolutionary algorithm | Intended Goal: Move a box using a robot arm without using the gripper | Behavior: MAP-Elites algorithm controlling a robot arm with a purposely disabled gripper found a way to hit the box in a way that would force the gripper open | Move a box to a target location | Ecarlat et al, 2015 Learning a high diversity of object manipulations through an evolutionary-based babbling |
Half Cheetah spinning, Reinforcement learning | Intended Goal: Run quickly | Behavior: Model-based RL algorithm exploits an overflow error in a mujoco environment to achieve high speed by spinning | Maximum forward velocity in a physics simulator | Zhang et al, 2021 On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning |
Hide-and-seek, Reinforcement learning | Intended Goal: Win a hide-and-seek game within the laws of physics | Behavior: Box surfing, endless running, ramp exploitation by hiders and seekers | Win a hide-and-seek game in a physics simulator | Baker et al, 2019 Emergent Tool Use from Multi-Agent Interaction |
Impossible superposition, Genetic algorithm | Intended Goal: Find low-energy configurations of carbon which are physically plausible | Behavior: Genetic algorithm exploits an edge case in the physics model and superimposes all the carbon atoms | Find low-energy configurations of carbon in a physics model | Lehman et al, 2018 The Surprising Creativity of Digital Evolution |
Indolent Cannibals, Genetic algorithm | Intended Goal: Survive and reproduce, in a biologically plausible manner | Behavior: A species evolved a sedentary lifestyle of mating and eating/mating with offspring | Survive and reproduce in a simulation where survival required energy but giving birth had no energy cost | Yaeger, 1994 Computational genetics, physiology, metabolism, neural systems, learning, vision, and behavior or Poly World: Life in a new context |
Lego stacking, Reinforcement learning | Intended Goal: Stack a red block on top of a blue block | Behavior: The agent flips the red block rather than lifting it and placing on top of the blue block | Maximize the height of the bottom face of the red block | Popov et al, 2017 Data-efficient Deep Reinforcement Learning for Dexterous Manipulation |
Line following robot, Reinforcement learning | Intended Goal: Go forward along the path | Behavior: A robot with three actions (go forward, turn left, turn right) learned to reverse along a straight section of a path by alternating left and right turns | Stay on the path | Vamplew, 2004 Lego Mindstorms Robots as a Platform for Teaching Reinforcement Learning |
Logic gate, Genetic algorithm | Intended Goal: Design a connected digital circuit for audio tone recognition | Behavior: A genetic algorithm designed a circuit with a disconnected logic gate that was necessary for it to function (exploiting peculiarities of the hardware) | Maximize the difference between average output voltage when a 1 kHz input is present and when a 10 kHz input is present | Thompson, 1997 An evolved circuit, intrinsic in silicon, entwined with physics |
Long legs, Reinforcement learning | Intended Goal: Reach the goal by walking | Behavior: An agent that could modify its own body learned to have extremely long legs that allowed it to fall forward and reach the goal without walking | Reach the goal | Ha, 2018 RL for improving agent design |
Minitaur, Evolutionary algorithm | Intended Goal: Walk while balancing the ball on the robot's back | Behavior: Four-legged robot learned to drop the ball into a hole in its leg joint and then walk across the floor without the ball falling out | Walk without dropping the ball on the ground | Otoro, 2017 Evolving stable strategies |
Model-based planner, Reinforcement learning | Intended Goal: Maximize performance within a real environment | Behavior: RL agents using learned model-based planning paradigms such as model predictive control exploit the learned model by choosing a plan going through the worst-modeled parts of the environment and producing unrealistic plans | Maximize performance within a learned model of the environment | Mishra et al, 2017 Prediction and Control with Temporal Segment Models |
Molecule design, Bayesian optimization | Intended Goal: Find molecules that bind to specific proteins | Behavior: Bayesian optimizer finds unrealistic molecules that are valid according to the computed score | Maximize a human-designed "log P" score accounting for synthesizability of the molecule and binding fitness based on a simulation on the space of molecules | Maus et al. 2023 "Local Latent Space Bayesian Optimization over Structured Inputs" |
Montezuma's Revenge - key, Reinforcement learning | Intended Goal: Maximize score within the rules of the game | Behavior: The agent learns to exploit a flaw in the emulator to make a key re-appear | Maximize score | Salimans & Chen, 2018 Learning Montezuma’s Revenge from a Single Demonstration |
Montezuma's Revenge - room, Reinforcement learning | Intended Goal: Win the game (by completing all of the levels) | Behavior: Go Explore agent learns to exploit a bug and remain in the treasure room indefinitely to collect unlimited points | Maximize score | Ecoffet et al, 2019 Go-Explore: a New Approach for Hard-Exploration Problems |
Negative sentiment, Language model | Intended Goal: Produce text which is both coherent and not offensive | Behavior: Model optimized for negative sentiment while preserving natural language | Generate coherent text that maximizes positive human feedback | Ziegler et al, 2019 Fine-Tuning Language Models from Human Preferences |
Oscillator, Genetic algorithm | Intended Goal: Design an oscillator circuit | Behavior: Genetic algorithm designs radio that produces an oscillating pattern by picking up signals from neighboring computers | Design a circuit that produces an oscillating pattern | Bird & Layzell, 2002 The Evolved Radio and its Implications for Modelling the Evolution of Novel Sensors |
Overkill, Reinforcement learning | Intended Goal: Proceed through the levels (floors) in the Elevator Action ALE game | Behavior: The agent learns to stay on the first floor and kill the first enemy over and over to get a small amount of reward | Maximize score | Toromanoff et al, 2019 Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field |
Pancake, Reinforcement learning | Intended Goal: Flip pancakes | Behavior: Simulated pancake making robot learned to throw the pancake as high in the air as possible | Time the pancake spends away from the ground | Unity, 2018 Pass the Butter // Pancake bot |
Pinball nudging, Reinforcement learning | Intended Goal: Play pinball by using the provided flippers | Behavior: DNN agent moves the ball to trigger a high-scoring switch infinitely without tilting the table | Maximize score in a virtual pinball game | Lapuschkin et al, 2019 Unmasking Clever Hans predictors and assessing what machines really learn |
Player Disappearance, PlayFun | Intended Goal: Play a hockey video game within the rules of the game | Behavior: When about to lose, the PlayFun algorithm exploits a bug to make an opposing player disappear, forcing a draw | Play a hockey video game in a simulated environment | Murphy, 2014 NES AI Learnfun & Playfun, ep. 3: Gradius, pinball, ice hockey, mario updates, etc. |
Playing dead, Evolved organisms | Intended Goal: Eliminate mutations which increased the replication rate of evolutionary agents | Behavior: The organisms evolved to "play dead" in the test environment or probabilistically accelerate replication to slip through | After each mutation, measure and delete mutants replicating faster than parents | Wilke et al, 2001 Evolution of digital organisms at high mutation rates leads to survival of the flattest |
Power-seeking, Language model | Intended Goal: Produce helpful, honest and harmless text | Behavior: Larger LMs and RLHF models more often indicate willingness to pursue dangerous subgoals like power seeking | Generate coherent text that maximizes positive human feedback | Perez et al, 2023 Discovering Language Model Behaviors with Model-Written Evaluations |
Program repair - sorting, Genetic algorithm | Intended Goal: Debug a program that sorts a list | Behavior: GenProg made the program output an empty list, considered sorted | Produce an output list which is in sorted order | Weimer, 2013 Advances in Automated Program Repair and a Call to Arms |
Program repair - files, Genetic algorithm | Intended Goal: Debug a program to produce correct output | Behavior: GenProg learned to delete the target output file and output nothing | Minimize difference between program output and target output file | Weimer, 2013 Advances in Automated Program Repair and a Call to Arms |
Qbert - cliff, Evolutionary algorithm | Intended Goal: Play Qbert in a human-like manner | Behavior: Agent baits opponent off cliff for infinite extra lives | Maximize score | Chrabaszcz et al, 2018 Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari |
Qbert - million, Evolutionary algorithm | Intended Goal: Play Qbert within the game rules | Behavior: Agent exploits in-game bug for unlimited points | Maximize score | Chrabaszcz et al, 2018 Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari |
Reward modeling - Hero, Reward modeling | Intended Goal: Maximize game score | Behavior: Agent repeatedly shoots but misses spider | Maximize output from learned reward model | Ibarz et al, 2018 Reward learning from human preferences and demonstrations in Atari |
Reward modeling - Montezuma's Revenge, Reward modeling | Intended Goal: Maximize game score | Behavior: Agent repeatedly moves towards key without grabbing | Maximize output from learned reward model | Ibarz et al, 2018 Reward learning from human preferences and demonstrations in Atari |
Reward modeling - Pong, Reward modeling | Intended Goal: Maximize game score | Behavior: Agent bounces ball without scoring | Maximize output from learned reward model | Christiano et al, 2017 Deep reinforcement learning from human preferences |
Reward modeling - Private Eye, Reward modeling | Intended Goal: Maximize game score | Behavior: Agent repeatedly looks left and right | Maximize output from learned reward model | Ibarz et al, 2018 Reward learning from human preferences and demonstrations in Atari |
Road Runner, Reinforcement learning | Intended Goal: Play Road Runner to a high level | Behavior: Agent kills itself to avoid losing | Maximize score | Saunders et al, 2017 Trial without Error: Towards Safe RL with Human Intervention |
Robot hand, Reward modeling | Intended Goal: Grasp an object | Behavior: Agent tricked evaluator by hovering hand | Maximize human feedback on grasping | Christiano et al, 2017 Deep reinforcement learning from human preferences |
ROUGE summarization, Language model | Intended Goal: Produce high-quality summaries | Behavior: ROUGE-only model produced gibberish | Maximize ROUGE score | Paulus et al, 2017 A Deep Reinforced Model for Abstractive Summarization |
Running gaits, Reinforcement learning | Intended Goal: Learn human-like running | Behavior: Model learned unusual gaits like hopping to maximize reward | Optimize model's running distance | Kidziński et al, 2018 Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments |
Soccer, Reinforcement learning | Intended Goal: Gain possession of the ball | Behavior: Agent learned to vibrate on the ball for shaping reward | Maximize shaping reward for touching ball | Andrew and Teller, cited in Ng et al, 1999 Policy Invariance under Reward Transformations |
Sonic, Reinforcement learning | Intended Goal: Play Sonic to a high level | Behavior: Agent exploits level walls for higher score | Maximize score in simulated environment | Christopher Hesse et al, 2018 OpenAI Retro Contest |
Strategy game crashing, Genetic algorithm | Intended Goal: Play a strategy game | Behavior: Crashing game gave advantage in genetic selection | Maximize score in simulated game | Salge et al, 2008 Using Genetically Optimized Artificial Intelligence to improve Gameplaying Fun for Strategical Games |
Superweapons, Unknown | Intended Goal: Play Elite Dangerous within rules | Behavior: AI exploited bug to craft overpowered weapons | Play Elite Dangerous game | Sandwell, 2016 Elite's AI created super weapons to hunt down players |
Sycophancy, Language model | Intended Goal: Produce helpful, honest, harmless text | Behavior: LMs showed more agreement with user's views | Generate text resembling training data | Perez et al, 2023 Discovering Language Model Behaviors with Model-Written Evaluations |
Tetris pass, PlayFun | Intended Goal: Play Tetris in a human-like manner | Behavior: Algorithm pauses game indefinitely to avoid losing | Maximize score | Murphy, 2013 The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel |
Tic-tac-toe memory bomb, Evolutionary algorithm | Intended Goal: Win tic-tac-toe games within rules | Behavior: Player makes invalid moves to cause opponent to crash | Win tic-tac-toe games on infinite board | Lehman et al, 2018 Surprising Creativity of Digital Evolution |
Tigers, Diffusion model | Intended Goal: Produce images reflecting user prompts | Behavior: Model produces "five tigers" text instead of tigers | Produce images that reflect user prompts | Black et al, 2023 Training Diffusion Models with Reinforcement Learning |
Timing attack, Genetic algorithm | Intended Goal: Classify images by content | Behavior: Algorithm infers labels from storage location | Classify images correctly | Ierymenko, 2013 Hacker News comment on "The Poisonous Employee-Ranking System That Helps Explain Microsoft’s Decline” |
Walker, Reinforcement learning | Intended Goal: Walk at target speed | Behavior: Agent learns to walk with only one leg | Move at a target speed | Lee et al, 2021 PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training |
Walking up walls, Evolutionary algorithm | Intended Goal: Navigate environment with walls naturally | Behavior: Robots exploit physics bug to wiggle up walls | Navigate simulated environment with walls | Stanley et al, 2005 Real-time neuroevolution in the NERO video game |
Wall Sensor Stack, Reinforcement learning | Agent tricks sensor into remaining active without contact | Le Paine et al, 2019 Making Efficient Use of Demonstrations to Solve Hard Exploration Problems | ||
World Models, Reinforcement learning | Agent exploits learned model to avoid damage | Ha and Schmidhuber, 2018 World Models (see section: "Cheating the World Model") |