How easy is it to fool production computer vision systems? In a previous post we looked at naturally confusing examples such as chihuahuas and muffins and found that Google’s Cloud Vision API was not easily fooled. This post looks at another approach — adversarial perturbations — to see how well they perform against this system.
The best starting points for understanding adversarial examples are this post by Andrej Karpathy and this post by Julia Evans. At a high level, the process involves adding noise to an image so that a neural network will misclassify the image. Rather than add random noise (which could make the image look unrecognizable to humans as well), this research exploits the gradients of the neural network to perturb it as little as possible while still achieving a misclassification. This is a subset of a broader field of research on generative adversarial networks.
One initial question about these adversarial perturbations is how well they generalize. Can an adversarial example from one generator-discriminator pair also fool another discriminator? That’s the question that Seyed-Mohsen Moosavi-Dezfooli and his coauthors take up in their recent paper, Universal adversarial perturbations. They find “universal” perturbations (meaning that they work across many diffferent images) for six popular image classification networks, and show that these perturbations generalize across networks reasonably well.
For my experiment, I used twelve images from Moosavi-Dezfooli et al. (available in their Arxiv supplementary material) and the six pre-computed adversarial perturbations available on Github. I perturbed each of the twelve images with each of the six perturbations. Then, I labeled both the original images and their perturbed versions via the Google Cloud Vision API. Code for this experiment is available on Github.
After labeling the images, I then computed how well the perturbations worked. In their paper, Moosavi-Dezfooli et al. focus on a measure that they call the “fooling rate.” In my experiment, I used two other metrics. The first is the IOU (intersect-over-union) of the true versus the perturbed labels. For example, if the true labels for an image are “dog” and “mammal” and the predicted labels on the perturbed image are “mammal” and “cat”, this would have an IOU score of 1/3 (one shared label and three total). The second metric is a weighted IOU, in which the labels are weighted by the confidence measure returned by the classifier. The former metric can be considered “how much” we fooled the DNN. The latter metric accounts for uncertainty in the DNN’s predictions and gives a sense of “how well” we fooled it.
For example, for the first image in the data set, Google Cloud Vision gave it eight labels: “dog”, “mammal”, “vertebrate”, “dog breed”, “dog like mammal”, “dog breed group”, “tibetan terrier”, “schnoodle”, “havanese”, and “bouvier des flandres”. For the version modified with the VGG-19 perturbation, the resulting labels were “art”, “knitting”, “thread”, “textile”, “pattern”, “wool”, and “material”.
Here is what the original example, the perturbed version, and the pre-computed perturbation look like:
Which perturbations worked the best? For both the unweighted and weighted IOU scores, the ranking was the same: VGG-19 does the best at fooling the Cloud Vision API, followed by VGG-F, CaffeNet, VGG-16, GoogLeNet, and then ResNet. This suggests that Google’s Cloud Vision API shares some characteristics (architecturally or in training data) with VGG-19. I was surprised that the GoogLeNet perturbation was not the frontrunner, but perhaps it differs from their production system.
How robust are images to perturbations? Image 6 (a salamander in a jungle environment) was the hardest to perturb in a way that achieved misclassification (IOU=0.55, WIOU=0.55). Images 4 (porcupine), 5 (orca), and 11 (coffee maker) were the easiest to perturb, achieving IOU and WIOU scores of zero for all three.
What did the classifier “think” it was seeing in the pertubed images? Certain labels appeared to be common: “art” and “textile” both commonly appeared. This suggests that the perturbations are fooling the classifier in a particular way, by making the images resemble these other classes. These are relatively heterogeneous classes, so this makes sense. It also suggests that it might be more difficult to perturb images of artwork or textiles so that the classifier thinks they belong to another class.
This experiment demonstrates that even a high quality production system is vulnerable to adversarial perturbations. It also supports the notion that adversarial examples generalize across images and classifiers.