Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami; Maksim Gaiduk; Markus Krimmel; Louis Mark Milliken; Bo Wang; Denis Bykov

Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Mark Milliken, Bo Wang, Denis Bykov

Published: 18 Apr 2026, Last Modified: 23 Apr 2026ACL 2026 Industry Track OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion LM, continued pretraining, quantization-aware training, multilingual embeddings, retrieval, contextualized embeddings, contrastive learning

Abstract: We introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling to better preserve global context across long documents. We release pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark.

Submission Type: Deployed

Copyright Form: pdf

Submission Number: 228

Loading