Track: Proceedings Track
Keywords: foundation models, vision transformers, in-context learning, 3d viewpoint generalization, object segmentation, benchmarking, contrastive pretraining
TL;DR: We extend Hummingbird with MVImgNet to evaluate how foundation model encoders perform in-context object segmentation across unseen 3D viewpoints.
Abstract: This paper extends the Hummingbird framework with the Multi-View ImageNet (MVImgNet) dataset to evaluate how foundation model image encoders handle in-context object segmentation under unseen camera angles. We group MVImgNet object views and construct memory banks from selected viewpoints, assessing generalization by evaluating performance on held-out angles. In addition to seven pretrained Vision Transformer (ViT) models (CLIP, DINO, DINOv2, DINOv3, SigLIP2, C-RADIOv2, and TIPS), we include VGGT, a geometry-grounded ViT model trained for multi-view 3D scene understanding. Our results show that DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require a dedicated multi-view implementation to properly reveal their geometric reasoning capabilities. These findings highlight the benefits of contrastive pretraining for robust performance across large viewpoint shifts. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval.
Submission Number: 37
Loading