FeatSharp: Your Vision Model Features, Sharper

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: We provide a method of upsampling vision model features by jointly leveraging the low resolution buffer and a mosaic of higher-resolution tiles.
Abstract: The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$px, while the "high-resolution" versions are around $378-448$px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp
Lay Summary: Modern computer vision models, while powerful, often lack the ability to process high resolution images, or are only able to produce representations in low resolution. This makes them challenging to use for tasks that require high-res, such as detecting small objects in an image (e.g. find the bird flying in the sky), or labeling every pixel within the scene as some category (e.g. "bird", "sky", "tree", etc.). We present a method for enabling low-res-only vision models to produce hi-res representations by carefully upsampling them, combined with additional passes through the model (called tiling) to get details for small objects that are otherwise not large enough to be encoded properly. In doing so, we demonstrate that we can improve on various dense task benchmarks for numerous base vision models.
Link To Code: https://github.com/NVlabs/FeatSharp
Primary Area: Applications->Computer Vision
Keywords: computer vision, perception, upsampling
Submission Number: 2573
Loading