To researchers’ surprise, deep learning vision algorithms often fail at classifying images because they mostly take cues from textures, not shapes.

Quote Originally Posted by Quanta Magazine
When you look at a photograph of a cat, chances are that you can recognize the pictured animal whether it’s ginger or striped — or whether the image is black and white, speckled, worn or faded. You can probably also spot the pet when it’s shown curled up behind a pillow or leaping onto a countertop in a blur of motion. You have naturally learned to identify a cat in almost any situation. In contrast, machine vision systems powered by deep neural networks can sometimes even outperform humans at recognizing a cat under fixed conditions, but images that are even a little novel, noisy or grainy can throw off those systems completely.

A research team in Germany has now discovered an unexpected reason why: While humans pay attention to the shapes of pictured objects, deep learning computer vision algorithms routinely latch on to the objects’ textures instead.

This finding, presented at the International Conference on Learning Representations in May, highlights the sharp contrast between how humans and machines “think,” and illustrates how misleading our intuitions can be about what makes artificial intelligences tick. It may also hint at why our own vision evolved the way it did.

Cats With Elephant Skin and Planes Made of Clocks

Deep learning algorithms work by, say, presenting a neural network with thousands of images that either contain or do not contain cats. The system finds patterns in that data, which it then uses to decide how best to label an image it has never seen before. The network’s architecture is modeled loosely on that of the human visual system, in that its connected layers let it extract increasingly abstract features from the image. But the system makes the associations that lead it to the right answer through a black-box process that humans can only try to interpret after the fact. “We’ve been trying to figure out what leads to the success of these deep learning computer vision algorithms, and what leads to their brittleness,” said Thomas Dietterich, a computer scientist at Oregon State University who was not involved in the new study.

To do that, some researchers prefer to look at what happens when they trick the network by modifying an image. They have found that very small changes can cause the system to mislabel objects in an image completely — and that large changes can sometimes fail to make the system modify its label at all. Meanwhile, other experts have backtracked through networks to analyze what the individual “neurons” respond to in an image, generating an “activation atlas” of features that the system has learned.

But a group of scientists in the laboratories of the computational neuroscientist Matthias Bethge and the psychophysicist Felix Wichmann at the University of Tübingen in Germany took a more qualitative approach. Last year, the team reported that when they trained a neural network on images degraded by a particular kind of noise, it got better than humans at classifying new images that had been subjected to the same type of distortion. But those images, when altered in a slightly different way, completely duped the network, even though the new distortion looked practically the same as the old one to humans.

To explain that result, the researchers thought about what quality changes the most with even small levels of noise. Texture seemed the obvious choice. “The shape of the object … is more or less intact if you add a lot of noise for a long time,” said Robert Geirhos, a graduate student in Bethge’s and Wichmann’s labs and the lead author of the study. But “the local structure in an image — that gets distorted super fast when you add a bit of noise.” So they came up with a clever way to test how both humans and deep learning systems process images.