GramStatTexNet: Efficient, Interpretable, and Neuro-Inspired Spatiotemporal Texture

GramStatTexNet: Efficient, Interpretable, and Neuro-Inspired Spatiotemporal Texture

ICLR 2026 Conference Submission21191 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: texture, synthesis, vision, neuroscience, grammian, gabor, filters, spatiotemporal, diffusion

TL;DR: Bio-inspired Grammian hybrid method for texture modeling and synthesis produces efficient, high quality spatial and spatiotemporal textures

Abstract: The development of sophisticated texture modeling and synthesis techniques, combined with deep connections to human vision modeling, has propelled advances in visual neuroscience, computer graphics, and beyond. Human peripheral vision is well modeled as local texture, scaled by distance from the center of gaze, with the most highly human validated models utilizing biologically-inspired filters and hand-curated statistics sets. Such models offer clear interpretability and a strong biological basis, but suffer from speed limitations and an inability to extend beyond the spatial domain. Conversely, deep learning methods like style-transfer and diffusion models generate high-quality results but at the cost of interpretability, biological plausibility, and fine-grain control, and are highly over-parameterized. We introduce GramStatTexNet, an analysis-by-synthesis model combining the multi-scale Gabor filter structure of classical texture models with the power and flexibility of Grammian-based approaches. Our model generates texture syntheses with similar quality to deep learning models while remaining interpretable, efficient, and biologically inspired. We create an organizational structure for our model statistics and leverage contrastive learning to identify statistics most important for categorizing texture, showing that this ordering correlates with synthesis quality, and identifying a further reduced set of statistics that retains high-quality synthesis. We demonstrate the tiled application of our model to full images, aggregating statistics over spatially-varying regions, an extension necessary for synthesizing foveated mongrels/metamers. In addition, we use our method to extend synthesis into the spatiotemporal domain with videos, paving the way for spatiotemporal peripheral vision models. Finally, we explore the incorporation of our statistics into modern diffusion models using gradient guidance. Our work bridges the gap between interpretability and high performance for texture models, providing an efficient framework for modeling human visual perception across space, time, and gaze location.

Supplementary Material: zip

Primary Area: applications to neuroscience & cognitive science

Submission Number: 21191

Loading