Diffusion-Pretrained Dense and Contextual Embeddings

Published: 18 Apr 2026, Last Modified: 23 Apr 2026ACL 2026 Industry Track OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion LM, continued pretraining, quantization-aware training, multilingual embeddings, retrieval, contextualized embeddings, contrastive learning
Abstract: We introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling to better preserve global context across long documents. We release pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark.
Submission Type: Deployed
Copyright Form: pdf
Submission Number: 228
Loading