Keywords: Dense Features, Vision Foundation Models
TL;DR: We increase the resolution of dense-feature extractors like the DINO model family.
Abstract: Pretrained on large image datasets, recent dense-feature extractors can produce very fine-grained spatial feature maps, enabling state-of-the-art performance on spatial reasoning tasks. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite the simplistic approach, BRIXEL outperforms baseline models by large margins on downstream tasks when the resolution is kept fixed, allowing for more efficient spatial reasoning.
Submission Number: 44
Loading