Test-Time Canonicalization by Foundation Models for Robust Perception

Published: 01 May 2025, Last Modified: 16 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Invariance to diverse and complex transformations by leveraging knowledge in vision foundation models.
Abstract: Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FoCal, a test-time robustness framework that transforms the input into the most typical view. At inference time, FoCal explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FoCal offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal .
Lay Summary: Current AI vision systems struggle with everyday transformations that humans handle effortlessly. Show a robot an upside-down chair or a strangely lit room, and it might fail completely. That's because these systems are trained on near-perfect internet photos, not the messy reality they encounter in the real world. Re-training the entire system with each new messy example would be expensive and impractical. The key insight: AI models trained on billions of internet images already know what objects typically look like. We tap into this knowledge by testing different versions of an image (rotating it, adjusting lighting, changing viewpoints) and picking the one that big models like CLIP and Stable Diffusion find most familiar. It's like how you'd mentally rotate an upside-down photo to understand it. This method works remarkably well, even for complex real-world transformations that have been very challenging for previous approaches (like viewpoint changes, day-night changes, and more). As a bonus, FoCal doesn't require re-training and works with any existing vision system. Our work is a step towards making vision systems reliable in real-world conditions, crucial for applications like home robots and self-driving cars.
Link To Code: https://github.com/sutkarsh/focal
Primary Area: Deep Learning->Robustness
Keywords: equivariance, invariance, CLIP, diffusion, canonicalization, ecological, unsupervised, test-time scaling
Submission Number: 402
Loading