Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual Retrieval, Dense Retrieval, Cross-lingual Transfer
TL;DR: A multilingual embedding model that also performs well on English retrieval — with probing experiments to understand why.
Abstract: This paper presents the training methodology of Snowflake Arctic-Embed 2.0, a set of open-source text embedding models built for effective and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradation compared to alternatives. Beyond describing the design and implementation details, we highlight critical research questions encountered during development, including the mechanisms of cross-lingual transfer in retrieval pre-training and what we term the "English performance gap" - the systematic quality difference between specialized English-only models and multilingual alternatives. Through targeted experiments addressing these questions, we derive insights from both positive and negative results, contributing to a broader understanding of multilingual embedding models and aiming to stimulate further research on improving cross-lingual representation quality while maintaining strong monolingual performance.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1161
Loading