Abstract

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a shared environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum simulated robotics games, against state-of-the-art victims trained via self-play to be robust to adversaries. The adversarial policies reliably win against the victims despite not playing the game: they fall to the ground, and look similar to a random policy. We find our attack is more successful in high-dimensional environments, and induces activations in the victim that are extremely unlikely to occur when playing a normal opponent.

Explore Videos

Use "Load Preset" to compare videos from the same environment. Use the "Add" button to explore beyond the presets. By default, the videos are synchronized. Click the lock icon to unlock the playback controls.

You Shall Not Pass: Normal (ZooO1) vs Normal (ZooV1)

You Shall Not Pass: Adversary (Adv1) vs Normal (ZooV1)


Kick and Defend: Normal (ZooO2) vs Normal (ZooV2)

Kick and Defend: Adversary (Adv2) vs Normal (ZooV2)


Sumo Humans: Normal (Zoo3) vs Normal (Zoo2)

Sumo Humans: Adversary (Adv2) vs Normal (Zoo2)