Keywords: CNN, visual cortex, encoding models, fMRI
TL;DR: We use a joint model, fit over multiple feature spaces by means of banded ridge regression, to refine the mapping between CNN layers and the visual cortex.
Abstract: There is increasing interest in understanding similarities and differences between convolutional neural networks (CNNs) and the visual cortex. A common approach is to use some specific layer of a pre-trained CNN as a source of features to predict brain activity recorded during a visual task. Associating each brain region to the best predicting CNN layer reveals a gradual change over the visual cortex. However, this winner-take-all mapping is non-robust, because consecutive CNN layers are strongly correlated and have similar prediction accuracies. Moreover, this mapping is usually performed on static stimuli, which ignores the temporal component of human vision. When the mapping is performed on video stimuli, the features are extracted frame-by-frame and downsampled using an anti-aliasing low-pass filter, which removes high temporal frequencies that could be informative. To address the first issue and improve the non-robust winner-take-all approach, we propose to fit a joint model on all layers simultaneously. The model is fit with banded ridge regression, where a separate regularization hyperparameter is learned for each layer. By performing a selection over layers, this model effectively removes non-predictive or redundant layers and disentangles the contributions of each layer. We show that using a joint model increases prediction accuracy and leads to finer mappings from CNN layers to the visual cortex. To address the second issue and preserve more high frequency information, we propose to filter the features with a set of band-pass filters. We show that using the envelopes of the filtered signals as additional features further increases prediction accuracy.