Towards flexible perception with visual memory

Published: 01 May 2025, Last Modified: 13 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We build a simple visual memory for classification that scales to the billion-scale regime, enabling a number of capabilities like unlearning and attributing model decisions to datapoints.
Abstract: Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is hard, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build on well-established components to construct a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models---beyond carving it in "stone" weights.
Lay Summary: Neural networks are typically trained in a way that makes their knowledge rigid and hard to change. This is a problem because the real world is always changing, and models need to be updated with new information or have old information removed. We use a method that combines a neural network with a flexible visual memory (similar to a database) to solve this issue. Our system breaks down image classification into two simpler steps: first, it uses an existing network to represent an image, and second, it searches the memory database for similar images. Based on these similar images (or 'neighbors'), the model then forms a decision like 'this is a tabby cat'. The memory component allows the model to be more flexible, enabling new data to be added easily, even up to a billion images, and also allowing old data to be removed. The model's decision-making process is also clearer, which means people can better understand and control its behavior. In this article, we show that a simple approach works well, especially when combined with a new "Rank Voting" method, which improves accuracy compared to other methods. But more importantly, this system enables flexible abilities and makes deep learning models more adaptable for real-world tasks where things are constantly changing.
Link To Code: https://github.com/google-deepmind/visual-memory
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: retrieval, memory, interpretability, unlearning
Submission Number: 11464
Loading