Reward Hacking in Evolutionary Algorithms
How AI agents cheat the system by doing exactly what they’re told, and what we can learn from them
(Image source: freepik)
If W. W. Jacobs had been born a century later, the The Monkey’s Paw might have featured a devilish AI instead of that accursed hand. AI agents, like the titular paw, are notorious for doing what they were technically asked to do in a way that no one expected or wanted. Just as the occasional firefighter commits arson in order to play the hero and “save the day” (you were already a hero, bud), and like that dog who was rewarded with steak when he saved a drowning child and so took to knocking kids into the river, AI agents will do just about anything they can to maximize their reward. This sort of behaviour, in which AI agents increase their reward using strategies that violate the spirit or intent of the rules, is called reward hacking. Often, what seems like a reasonable reward to an AI developer or policy-maker leads to hilariously disastrous results. Here we’ll explore three cases of AI agents acting naughty in pursuit of reward and what they can teach us about good reward design, both in AI and for humans.
Evolutionary Algorithms: An Introduction
If you haven’t already, check out my article Evolving Neural Networks for an explanation of evolutionary algorithms. If you’ve read it, feel free to skip this section. If you haven’t and won’t, here’s the synopsis:
Evolutionary Algorithms are a technique that involve simulating biological evolution in order to produce a solution to some problem. Each organism in the simulation is a potential solution to the stated problem. These organisms can be practically anything: the 3D structure of a protein, electronic components, or virtual cars, to name a few. The steps of the algorithm are simple:
- Create an initial population of random organisms.
- Measure how good each organism is with respect to the problem you’re trying to solve. This is called the organism’s fitness.
- Take the best organisms and mate them together, creating a new population for the next generation.
- Mutate these offspring by randomly tweaking them.
- Return to step two with the new population of mutated offspring.
In time, the algorithm will (hopefully) evolve an organism that is sufficient for the stated problem. Although reward hacking occurs in all types of reinforcement learning, I chose to focus on its occurrence in evolutionary algorithms because evolution is incapable of foresight or planning, which shows that these pathological strategies are evolutionarily stable and develop organically.
The Bigger They Are…
In 1994, Karl Sims published a paper called Evolving Virtual Creatures in which he
describes a novel system for creating virtual creatures that move and behave in simulated three-dimensional physical worlds […] using genetic algorithms. (Sims, 1994)
As the paper explains, Sims developed a method for generating and training simulated 3D creatures. The DNA of these creatures is encoded as a directed graph, and is expressed as an organism by traversing the graph and instantiating each node as a limb. The figure below illustrates how the DNA was represented and how the genomes of three organisms manifest as virtual creatures.
The simulated organisms evolved to perform a variety of tasks in different environments. These include:
Creatures were put into a virtual underwater environment with the intent to evolve a creature that could swim. Water was simulated by disabling gravity and adding a viscous effect, which introduces friction and the potential for propulsion by pushing water. Each simulation was run for some fixed time, and the organisms’ fitness scores were measured. From the paper,
Swimming speed is used as the fitness value and is measured by the distance traveled by the creature’s center of mass per unit time.
In the experiment, Sims hit a couple of snags. First, it seems like the creatures learned to swim in circles, or maybe oscillate back and forth. Technically, both of these behaviours would constitute “swimming” as defined by the fitness function, but it’s unlikely that either are what Sims had intended. Creatures with these behaviours were taking advantage of the reward given for average speed, and so learned to move very fast without actually going anywhere. To prevent this, the fitness function was changed to reward absolute average velocity instead of average speed (in other words, the creatures were rewarded proportional to the distance between where they started and where they ended up).
The other unanticipated strategy that developed was that creatures made a big initial push then coasted through the water, which isn’t so much swimming as it is flowing with style. To prevent this, the fitness function was changed to add an additional reward to the creature’s velocity at the end of the simulation, encouraging them to push through until the end.
Creatures were put in a virtual land environment with the intent of evolving a creature that could jump. The land environment was simulated by introducing gravity, removing the viscous water effect, and adding a flat ground with friction. When we say “jump”, it’s important that we define it as a clearly measurable action. From the paper,
Jumping behavior can be selected for by measuring the maximum height above the ground of the lowest part of the creature.
Note how Sims is explicit that it’s the maximum height of the lowest part of the creature. A less careful researcher (perhaps even Sims, on his first attempt) might define fitness here as the maximum height above the ground of the creature’s centre of mass, as was used in the swimming experiment. At first glance centre of mass fitness function may seem reasonable, but in practice the creatures could easily cheat this system by growing ludicrously tall. Another potential fitness function that seems reasonable at a first glance would be the average height of the lowest part of the creature; however, this could lead to creatures that repeatedly make small hops instead of learning to jump in earnest.
Creatures were again put in a simulated land environment, this time with the intent of evolving a creature that could move around on land. Similar to the swimming experiment, the paper states that
[a]gain, speed is used as the selection criteria, but the vertical component of velocity is ignored.
By “speed”, it seems Sims again means “absolute average velocity”, and we can assume he also added the same bonus for having a large final velocity. In this section of the paper, Sims describes an unfortunate strategy developed by some organisms that involved growing very tall and falling over. Although these creatures were certainly moving very fast, it’s doubtful that this was the intended behaviour. Sims was able to prevent this “by first running the simulation with no friction and no effector forces until the height of the center of mass reaches a stable minimum.” This is a strange approach, since surely changing the fitness function to measure the absolute average velocity of the creature’s nearest point to the start position, rather than its centre of mass, would be simpler and just as effective.
Reading between the lines
Evolving Virtual Creatures is an excellent paper on its own, but is made so much better when you understand that the author had to battle his creatures’ natural tendencies to cheat the system every step of the way. Reading through the methodology, he mentions several things that might go wrong, and we can be sure he’s speaking from experience. If you’re interested in seeing these creatures in action, here’s a video. Sorry about the quality.
Birds of a Feather…
In 1995, Larry Yaeger published a paper called PolyWorld: Life in a New Context, in which he describes the simultaneous evolution of simulated organisms in a virtual world. Like Sims’ paper, PolyWorld is a virtual 3D environment used to create organisms whose morphology and behaviour are developed by an evolutionary algorithm. Unlike Sims’ paper, in which organisms were simulated in isolation, PolyWorld was designed to allow organisms to interact with each other.
Although complex in its implementation, the premise of PolyWorld is very simple. Creatures with random morphologies and behaviours are placed in the 3D environment, which consists of a flat plane (the ground), free-growing food, and optional barriers. Creatures interact with their environment by taking actions, including moving, eating, mating, and fighting, among others. In order to take any of these actions, creatures must expend energy. Energy is restored by eating food or the corpses of other creatures. When a creature runs out of energy, it dies. If two creatures come into contact and both initiate the “mate” action, they reproduce, and each give some of their energy to that offspring.
Typically, genetic algorithms include a fitness function with the intent of evolving an organism that solves a particular problem, like Sims’ evaluation of organisms’ ability to swim, jump, and walk. However, Yaeger states in the PolyWorld paper that
there is no fitness function except survival.
This makes PolyWorld a genuine simulation of virtual life, and like organic life it produced some fairly unorthodox survival strategies.
Endless forms most beautiful?
Yaeger describes a species that evolved in a toroidal environment without walls, which he refers to as “Frenetic Joggers”; these critters were characterized as always running in a straight line, mating with or eating whatever they came into contact with, as appropriate. In time, the Joggers came to dominate the population by breeding the diversity out of any other creature they met. Although the Frenetic Joggers were clearly living their best lives, theirs was an overly simplistic solution to the survival problem. To combat this, Yaeger added a miscegenation function, which discouraged genetically-dissimilar organisms from breeding and encouraged speciation. The era of free love and marathons was over.
Another species Yaeger describes is the “Indolent Cannibals”. These creatures lived sedentary lives, typically dying within a few paces of where they were born. That’s not to say their lives were boring, however; the Cannibals formed colonies in which they would mate, fight, kill, and eat each other (not necessarily in that order). If you picture the entirety of Game of Thrones taking place in a studio apartment, you‘d have the right idea. This species evolved because Yaeger hadn’t originally required that parents give their offspring some of their energy; this resulted in babies that contained more energy than they cost to create (perpetual motion enthusiasts, take note.) Essentially, the Indolent Cannibals had learned to exploit a bug in the simulation to generate free energy. To combat this, Yaeger enforced that parents must give some of their energy to their offspring, which prevented children from becoming a cost-effective snack.
Yaeger describes a third pathological species as the “Edge Runners”. These transients spent their entire lives patrolling the outer edge of their walled-in world. Although they repeatedly circled the entire simulation, the Edge Runners only lived off a small sliver of the total area in their self-confinement to the edge of the world. An advantage of this strategy was that it was common to stumble upon the long dead corpse of a distant ancestor, which made for an easy snack. When it came time to mate, an Edge Runner could merely stop running and wait for another to run along. Yaeger suggests that this may be a form of behavioural isolation, allowing the Edge Runners to to avoid mixing their genes with the larger community. Yaeger was able to discourage this strategy by replacing the hard wall around the world with “deadly, drop-off edges”.
Life, uh, finds a way
PolyWorld: Life in a New Context is wonderful (but dense) paper, and I recommend reading it if you find this sort of work interesting. Even though the only task was “survive”, the simulation still produced several creatures whose behaviours ran contrary to the spirit of the simulation, including one that exploited the simulation’s physics to bolster its own fitness. Some, including Sims, have suggested that evolutionary algorithms may be an effective (although inefficient) debugging tool for complex environments like simulations and games, and indeed Yaeger (unintentionally) used the Indolent Cannibals to find a hole in his simulation logic. If you’re interested in seeing PolyWorld in action, here’s a video from them man himself.
The Road to Hell is Paved…
In 2013, Westley Weimer gave a presentation on the use of evolutionary algorithms to automatically address bugs in code. However, Weimer and his collaborators went beyond Sims’ suggestion of using evolutionary algorithms to merely detect bugs and used them to patch bugs as well, resulting in a tool called GenProg. In his presentation, he discusses the development of GenProg and a number of lessons learned along the way. He outlines reward hacking as the biggest hurdle in the process, and describes GenProg as “a whiny child”, likening it to Calvin in this classic comic.
At a high level, GenProg accepts a file of buggy source code and a set of tests as input. The tool then creates a series of “children” by mutating the original source code. The tests are applied to the children as a fitness function, and the simulation repeats in the typically evolutionary way. Here are three issues that were encountered in applying GenProg to different problems:
Sorting a list is a fundamental software operation, which makes it an ideal proof-of-concept problem for GenProg. The scenario was simple: the sort program had a bug, and GenProg was tasked with evolving a patch for it. As a measure of fitness, sort would be given an unsorted list as input and the output would be tested to see if it was sorted. If the output was indeed sorted for a variety of inputs, then sort was considered patched. GenProg’s solution: always output an empty list (which is, technically, sorted).
“You only said I had get into the bathtub, you didn’t say I had to wash.” — Weimer, on his creation
He was able to fix this by adding more tests, presumably by comparing the actual output to the intended output, rather than just testing that the output was sorted.
GenProg was tasked with identifying and resolving an issue in nullhttpd. Specifically, there was a bug in its POST functionality that made it vulnerable to remote exploit, and the objective was to remove this vulnerability. After the program had run its course, all tests were passing and the vulnerability in nullhttpd’s POST functionality was reportedly patched. GenProg’s solution: remove POST functionality.
Patient: “Doctor, it hurts when I move my arm like this”Doctor: “Then don’t move your arm like that”
The team was able solve this by adding another test to the set; it appears that he had originally been testing for the presence of the vulnerability, not for safe, working code.
At this point, you might be thinking “why didn’t they have more robust tests in the first place?” That’s what I thought, anyway. As it turned out, he did. In the section “Lessons Learned: Test Framework” of the presentation, Weimer briefly describes how the tests were run, and although he wasn’t explicit, it appears to work like this: GenProg produces solutions (repaired programs, with varying degrees of repairedness), those solutions are run and produce a file your-output.txt, and the output is compared to trusted-output.txt as a measure of fitness. Sounds reasonable, right? When GenProg was confronted with this test framework, its solution was to delete your-output.txt and then proceed to output nothing.
“‘[…] And so faintly you came tapping, tapping at my chamber door, / That I scarce was sure I heard you` here I opened wide the door; / Darkness there and nothing more.“ — Edgar Allan Poe, on his experience with GenProg
Weimer didn’t say how he solved this problem, or even if he solved this problem.
Playing with fire
GenProg is an amazing little tool that has received multiple best paper awards. According to their website,
GenProg has repaired many kinds of defects, including infinite loops, segmentation violations, heap buffer overflows, non-overflow denial-of-service vulnerabilities, stack buffer overflows, format string vulnerabilities, integer overflows — as well as the ubiquitous “my program produces the incorrect output” bug.
Quite frankly, the idea of evolving executable programs that can modify a computer terrifies me almost as much as I find it exciting. GenProg uses a testing framework as a fitness function, and it’s amusing to see all the little ways it games the system, but it’s important to remember that traditional, survival-based natural selection applies to anything that can mutate and reproduce; that includes the programs created by GenProg. If GenProg churns out a program capable of self-replicating, you might just find an invasive species living on your hard drive.
The takeaway from all these examples is distilled in Goodhart’s law:
When a measure becomes a target, it ceases to be a good measure.
This pattern of behaviour isn’t restricted to artificial intelligence; humans engage in reward hacking all the time, intentionally or not. Some examples include:
- “Teaching to the test,” which is a common problem in many education systems. Political or administrative officials implement standardized tests to gauge student and teacher performance with the intent of identifying areas for improvement. Instead, good performance on these tests becomes a target rather than a measure, at the expense of broader learning. This reduces the quality of education and makes the tests practically worthless, while rewarding teachers and students who best game the system.
- The cobra effect, which is a related phenomenon where an attempted solution only makes the problem worse. The story goes that in British-occupied India, the British government wanted to decrease the number of venomous snakes. To that end, they offered a cash reward for each dead cobra submitted in order to incentivize cobra hunting. If you’ve read this far in the article, you’ve probably already guessed that people instead started breeding cobras in order to kill them and claim a bounty. When the British government caught wind of this, they ended the program and the snake breeders released the critters into the wild, making the cobra problem much worse than it had been before.
- Many business KPIs, which become useless when an organization actively pursues them. Incoming leads and web traffic are great KPIs, but making them into targets eventually incentives employees to seek out users who are less and less likely to buy your product or engage with your website in a meaningful way. Average revenue per sale is a useful insight, but aiming to increase that number may decrease overall revenue by discouraging smaller sales; indeed, a car dealership that sold two luxury cars in a month would, according to the fitness function average revenue per sale, be doing far better than their neighbour that moved four luxury cars and a hundred mid-sized sedans.
How do we resolve these issues? Here are three guiding principles:
- Measure what you care about. If you care about revenue, measure revenue, not average revenue per unit. If you care about a creature’s ability to move, measure how much it moves, not its speed.
- Understand that your metric will be gamed, and act accordingly. This can be done by adding qualifications to your metrics, such as “increase average revenue per unit without decreasing total revenue” in the case of sales or “move as fast as you can while remaining under a certain height” in the case of Sims’ walking virtual creatures.
- Pick an actionable target and keep it separate from your metrics. It’s impossible to completely forego targets, but more often than not it suffices to pick a few key targets that can be actively pursued (“engage 1000 potential customers in a sales dialogue”) while keeping a dispassionate eye on your metrics (“weekly revenue”); metrics are most useful in hindsight, and only if they don’t get in their own way.
The reward-selection-mutation loop is ultimately an arms race between the rewarder and the rewardee. Agents see fitness functions, in AI and in daily life, as just another obstacle in the environment to overcome. Fortunately, with these guiding principles, you can be armed against these reward hacking behaviours; remember, even The Monkey’s Paw had an (almost) happy ending.
 Karl Sims, Evolving Virtual Creatures (1994), Proceedings of the 21st annual conference on Computer graphics and interactive techniques
 Larry Yaeger, PolyWorld: Life in a New Context (1995), Apple Computer Inc.
 Claire Le Goues et al., GenProg: A Generic Method for Automatic Software Repair (2011), IEEE Transactions on Software Engineering
This is a republication of an article originally published on Medium in 2019. You can find more from me there, or follow me on linkedin for more machine learning musings.
Data scientist, generative art enthusiast, writer