Skip to content


Timo Denk edited this page Jun 23, 2018 · 1 revision

Summary of a lecture by Ian Goodfellow at Stanford University

What are Adversarial Examples?


  • an example input that has been carefully constructed in order to fool the network into making a wrong decision

Where do they appear?

  • not just in CNNs, but also in logistic regression, SVMs, even in nearest neighbor algorithms
  • there are also fun adversarial examples for the human brain (Pinna and Gregory, 2002)

Why do Adversarial Examples Exist?

1st Idea: Overfitting

  • if the model has more parameters than it needs to fit the training data, it's susceptible to misclassify new inputs
  • but: If this were true, adversarial examples would be random and therefore unique to a network. Experiments have shown that the opposite is true however.

The Systematic Effect of Adversarial Examples

  1. adversarial examples are transferable across networks and across network architectures
  2. if the delta |x - x*| of a clean image x and an adversarial example x* is added to another clean image y, the resulting image is often also adversarial

Linearity in Deep Networks

  • in modern deep networks, the mapping from input to output is actually very piece-wise linear (e.g. due to activation functions like ReLU)
  • however, the mapping from parameters to output is very complex - which is why training is not easy
  • the near-linear mapping from input to output makes adjusting input images to the output (the inverse of training) very easy
  • above, one can see the logits values for specific classes - which behave very linearly as one changes the input image (a car) by eps * (small perturbation)

==> here, they found a perturbation direction that was associated with the frog class

Constructing Adversarial Examples

The Fast Gradient Sign Method

  • idea: maximize the loss that a given input image causes in the network => calculate the gradient with respect to the input image and add it to the input image:

  • the sign function here enforces the epsilon-constraint (the perturbation's max norm must be ≤ epsilon)

Maps of Adversarial and Random Cross Sections

  • legend on the left: FGSM vector is left-to-right and a random orthogonal direction is top-to-bottom (both applied by -eps to +eps)

  • on the right, the resulting 2D classification map (colors means incorrect class, white means correct class) of different CIFAR-10 are shown

  • observations:

    • in most of the images, half of the map is classified correctly with a near-linear boundary
    • FGSM has identified a direction where if we get a large dot-product with this direction, we can get an adversarial example
    • adversarial examples live in linear subspaces (not tiny points in the input space) => all nearby images are also adversarial examples
  • how many dimensions do these adversarial subspaces have?

    • on average: 25 (on MNIST where you have 28^2 = 784 total input dimensions)
    • this tells you how likely you are to find an adversarial example from random noise
    • also, the larger the subspaces for two models, the more likely it is that they intersect => transferable examples

The Idea of Clever Hans

  • intuition: the model learns some distribution of training examples that seem "natural"
  • with an adversarial example, one leaves this "natural" distribution which the network can't handle

Good Defense: RBFs

  • when using the FGSM attack on these quadratic networks, you actually transform the image into another class
    • ==> not technically an adversarial example
  • however: RBFs have very poor performance

Adversarial Attacks

Black Box Attacks

  • basis: cross-model and cross-dataset transferability
  • idea: attacker wants to fool a network that he has no information about (architecture, type, dataset, ...)
  • attack:
    1. train your own model mimicking target model
    2. create adversarial example for own model
    3. deploy adversarial example against the target
  • in practice, about 70% of examples transfer cross-dataset

Enhancing Transfer with Ensembles

  • idea: use an ensemble of many different models in order to create adversarial examples (Liu et al. 2016)
  • => probability is almost 100% that the attack will be successful on another (target) model


Defenses are very Hard

  • many failed attempts
  • regularization alone does not do the trick
  • even using a generative model is insufficient

Adversarial Training

  • neural nets can represent any function, but max-likelihood does not cause them to learn the right one
  • idea: train on adversarial examples
    • works quite well (for FGSM attacks), but not for other, iterative attacks
    • interesting effect: training on adversarial examples makes the classification task better (it can be seen as a kind of regularization)
  • these adversarially trained networks have the best empirical success rate

Virtual Adversarial Training

  • use unlabeled data: use model guess for an image, create adversarial perturbation intended to change the guess
  • idea of semi-supervised training: use labeled and unlabeled data

Why is Solving the Adversarial Problem so Interesting?

  • on the one hand, it prevents attackers from causing the network to make a wrong decision
  • but on the other hand, one could use it to design molecules, fast cars, new circuits, etc.
    • why? because then the attacks do not create adversarial examples that trick the network, but instead apply a perturbation to the input that "makes sense"
    • example: use the blueprint of a car as input, train network to guess the car's speed, apply perturbation to the blueprint that increases the speed the network assigns to the blueprint => get a blueprint of a fast car instead of an adversarial example