Memory-Efficient Multilingual Embeddings with a Diffusion-LM Backbone

Published: 03 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization-aware training, multilingual embeddings, retrieval, contextualized embeddings
Abstract: Dense textual embeddings are essential for web-scale search and retrieval-augmented generation, but their high memory and storage costs limit deployment across billions of documents. We introduce pplx-embed, a family of multilingual text embedding models that employs native quantization-aware training (QAT) throughout its multi-stage contrastive training pipeline, producing INT8 embeddings by default that achieve competitive retrieval performance compared to full-precision results while offering 4x memory and storage reduction. Furthermore, leveraging diffusion-based language models with bidirectional attention, pplx-embed family captures global document context more comprehensively than causal autoregressive models, enabling superior performance in document-level retrieval tasks. We release two model variants: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. Our extensive evaluations on public and internal benchmarks demonstrate the effectiveness of our quantization-aware training in achieving memory and storage efficient first-stage retrievers without more than minimal losses in retrieval quality. Our internal evaluation suite focuses on real-world, large-scale search scenarios constructed from 1B production web pages. We publicly release the pretrained pplx-embed-v1 and pplx-embed-context-v1 model weights to facilitate further research on memory-efficient multilingual retrieval.
Submission Number: 76
Loading