Model manifold analysis suggests the human visual brain is less like an optimal classifier and more like a feature bank

Colin Conwell; Michael Bonner

Model manifold analysis suggests the human visual brain is less like an optimal classifier and more like a feature bank

Colin Conwell, Michael Bonner

Published: 23 Sept 2025, Last Modified: 06 Dec 2025DBM 2025 Findings PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Using manifold geometry metrics to predict the downstream visual brain-alignment of diverse DNN models, we find optimal classifiability a less powerful motif than few-shot learnability.

Abstract: What do deep neural network (DNN) models tell us about the computational principles of visual information-processing in the biological brain? A now common finding in visual neuroscience is that many different kinds of DNN models -- each with different architectures, tasks, and training diets -- are all comparably performant predictors of image-evoked brain activity in the ventral visual cortex. This relative parity of highly diverse models may at first seem to undermine the common intuition that we can use these models to infer the computational principles that govern the visual brain. In this work, we show to the contrary that comparable brain-predictivity does not preclude the differentiation of these same models in terms of the underlying manifold geometries that define them. To do this, we assess 12 manifold geometry metrics computed across a diverse set of 117 DNN models, curated to include multiple tasks, architectures, and input diets. We then use these metrics to predict how well each model aligns with occipitotemporal cortex (OTC) activity from the human fMRI Natural Scenes Dataset. We find that manifold signal-to-noise ratio (a metric previously associated with few-shot learning) is a robust predictor of downstream brain-alignment, superseding both other manifold geometry metrics (i.e. manifold capacity) and downstream task-performance (e.g. top-k recognition accuracy) across multiple different image sets (e.g. ImageNet21K versus Places365) and controlled model comparisons (e.g. assessments across ImageNet-1K trained architectural variants only). These results add to a growing body of evidence that the ventral visual stream serves as a basis set (or feature vocabulary) for object recognition rather than as the actual locus of recognition per se.

Length: long paper (up to 8 pages)

Domain: methods

Author List Check: The author list is correctly ordered and I understand that additions and removals will not be allowed after the abstract submission deadline.

Anonymization Check: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and URLs that point to identifying information.

Submission Number: 67

Loading