Keywords: adversarial robustness, deep learning, representation learning, computer vision, neuroscience, human vision
Abstract: Humans have been shown to use a ''straightened'' encoding to represent the natural visual world as it evolves in time (H\'enaff et al.~2019). In the context of discrete video sequences, ''straightened'' means that changes between frames follow a more linear path in representation space at progressively deeper levels of processing. While deep convolutional networks are often proposed as models of human visual processing, many do not straighten natural videos. In this paper, we explore the relationship between robustness, biologically-inspired filtering mechanisms, and representational straightness in neural networks in response to time-varying input, and identify curvature as a useful way of evaluating neural network representations. We find that $(1)$ adversarial training leads to straighter representations in both CNN and transformer-based architectures and $(2)$ biologically-inspired elements increase straightness in the early stages of a network, but do not guarantee increased straightness in downstream layers of CNNs. Our results suggest that constraints like adversarial robustness bring computer vision models closer to human vision, but when incorporating biological mechanisms such as V1 filtering, additional modifications are needed to more fully align human and machine representations.
Supplementary Material: zip