DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment.
Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu 0001, Daniel Li 0006, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V. Vo, Patrick Labatut, Piotr Bojanowski