Are CNNs biased towards texture rather than object shape?
01 Dec 2020 | CNNs Adversarial examples Robustness ExplainabilityOverview of the paper ‘ ImageNet-trained CNNs are biased towards texture’
More than a decade ago, With the introduction of CNNs for classification tasks over few datasets available at the time, the field of deep learning and computer vision now has come a long way. We have seen many exciting applications, from detection, segmentation and 3D perception utilized in self-driving cars to understand their environment, tumor detection in medical imaging to GAN-based picture inpainting, colorization, and style transfer.

Although we are seeing so many exciting research papers with advancements in CNN architectures and their application domains, we still have little to no understanding as to why these systems decide as they do. That’s why we consider these systems ‘black box’, we don’t know the ‘reasoning’ behind a particular decision. And such behaviors cannot be overlooked just because they are scoring high on predefined metrics. For example, the Gender Shapes project shows that various face recognition systems perform worse on minority classes(accuracy difference of up to 34% between lighter-skinned males and darker-skinned females). Now if such systems are used for law enforcement, airport, or employment screenings, this bias can have major repercussions. This highlights the importance of ‘explainability’ in computer vision systems.
‘Adversarial attacks’ demonstrates one such counter-intuitive behavior of CNNs. These examples are specially devised to fool the CNNs into predicting wrong labels, just by altering the image by a noise indistinguishable to the human eye.

The above example demonstrate one such type of attack known as “FSGM attack”. This is known as a ‘white box’ adversarial attack as we have access to the weights of CNN. For training the CNNs, we minimize the loss towards the ground truth label given input image. On contrary, In FSGM attack, we calculate the input, or rather a noise to add to the input using gradients to optimize for maximum loss. As a result, we get the distortion unrecognizable to the human eye, while the CNNs predict wrong labels. There exists various such methods, such as ‘black box’ attack where we dont have any access to model parameters.
One such behavior is captured by the paper ‘ImageNet-trained CNNs are biased towards texture’. Let’s dive in…
Previous indications of Texture Bias:
By maximizing the activation values at different levels for a CNN architecture, we visualize the input that activates that particular neuron the most. Based on these visualizations, it is widely accepted intuition that a CNN combines low-level features (lines, edges) and learns more complex features (wheels, face, tree trunks) hierarchically. To put it differently, the predictions at the last layer will depend more on the global shape of the object rather than the local texture. But there are some contradictory findings, I am listing them below.
- “CNNs can still classify texturized images perfectly well, even if the global shape structure is completely destroyed” (Gatys et al., 2017; Brendel & Bethge, 2019).
- “Standard CNNs are bad at recognizing object sketches where object shapes are preserved yet all texture cues are missing” (Ballester & de Araujo, 2016).
- “Gatys et al. (2015) discovered that a linear classifier on top of a CNN’s texture representation (Gram matrix) achieves hardly any classification performance loss compared to original network performance”.
- “Brendel & Bethge (2019) demonstrated that CNNs with explicitly constrained receptive field sizes throughout all layers are able to reach surprisingly high accuracies on ImageNet, even though this effectively limits a model to recognizing small local patches rather than integrating object parts for shape recognition.”
So, the paper pindown this particular behavior of CNNs, which will explain the above observations. This behavior can be summed up in the following image.

Here we can clearly see even if the global structure of the cat is there, all the top predictions are based on texture i.e., elephant skin.
Psychophysical Experiment:
To extensively study this behavior, the authors proposed psychophysical experiments to test the CNNs against human counterparts. To understand the bias, we first have to disentangle the shape and texture information, to see where the subjects are inclined towards. This is done by swapping the original texture information through various means, namely the datasets for the psychophysical experiment.
Datasets:
- Original: 160 natural colored images with white background(to avoid any information coming from background).
- Greyscale: Images converted to greyscale, for CNNs stacked across 3 channels.
- Silhouette: Black image on white background, similar to semantic segmentation map.
- Edges: Edge-based representation through Canny Edge detector.
- Texture: 48 natural texture images or by having a repetition of the same object Cue Conflict: Images generated using Style Transfer algorithm, using original (Content) and texture (Style).

Here, greyscale, silhouette, edges, and cue conflict forms the experimental setup. The predictions on these images can either be shape-based or texture-based. Using multiple ways to swap out the texture, we can be sure that results are not due to a particular source texture. Now, as for labels, authors use 16 imagenet classes through wordnet hierarchy as labels. The authors chose 4 CNN networks trained on the imagenet namely VGG16, GoogleLeNet, AlexNet, and ResNet50. For the participants of the trial, they have to choose 1 of 16 labels for each image shown.
Solution: Stylized Imagenet
Now the texture bias is confirmed, the next step would be to try to nudge CNNs towards shape-bias, as exhibited by human counterparts. Authors explain that the imagenet task itself does not necessitate the CNNs to learn shape-based representations, integrations of local features work great for maximizing accuracy. Based on this hypothesis, The authors proposed a novel dataset as a solution: Stylised Imagenet (SIN). The goal is to strip the texture of the original image and replace it with a randomized style. This serves as both training data for robust models and test for robustness, which would be difficult for models to solve with shape bias.
The benefits of these SIN-trained models and the robustness test for the traditional models are discussed below.
Results:
The authors explain the results of the paper in 3 steps.
- The shape vs texture bias in the CNNs and humans. As explained in the psychophysical experiment section, the authors tested out if CNN predicts labels in accordance with shape information or textural information. The same experiment was conducted with human participants. The results show high texture bias in the models, whereas humans demonstrated high shape bias.
- Overcoming the texture bias using Stylized Imagenet The authors show that by training on the stylized imagenet, we see improvements towards shape bias.
- The robustness of models trained on Stylized Imagenet The paper tests SIN vs IN trained models for various distortions(uniform noise, high-pass and low-pass filtering, contrast, etc.), on which SIN models outperform their counterparts. On ImageNet-C, a standard benchmark dataset for corruption and robustness, SIN shows lower corruption errors than its vanilla counterparts, again proving the hypothesis that a more shape-oriented representation can help demonstrate robustness.




Conclusion:
This paper hypothesizes the texture bias as a way to explain the scattered findings which couldn’t be explained by our previous intuition of how CNNs works. For the scope of the paper, authors examine texture bias for ImageNet trained models found this behavior consistent across models. To nudge the models to learn shape bias, the authors present a novel dataset of stylized imagenet, whose results on texture bias were examined through various distorted images.
Further Discussion:
Now, this leaves us with 2 questions…
Does this behavior occur for only Imagenet trained images i.e., is it a dataset property?
As explained above, solving the imagenet does not require models to learn shape-based representations. A similar line of thought is discussed in another paper, which dives deeper into explaining adversarial examples, answering this question and others such as adversarial transferability.
If our previous understanding is wrong, what causes such behavior in CNNs?
This comes under the research paradigm of explainability in CNNs, which is an active research field. To learn more about this topic, you can give try this and this blog.