Nomic Embed: Training a Reproducible Long Context Text Embedder

Zach Nussbaum; John Xavier Morris; Andriy Mulyar; Brandon Duderstadt

Nomic Embed: Training a Reproducible Long Context Text Embedder

Zach Nussbaum, John Xavier Morris, Andriy Mulyar, Brandon Duderstadt

Published: 26 Feb 2025, Last Modified: 26 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}

Certifications: Reproducibility Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=dmWHFxu3Cq&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: * Fixed accidental styling override * Small typo fixes * Table title formatting to above table Rebuttal revisions * Removed background on RoPE * Added longer descriptions to tables * Standardized B -> bilion, M -> million * Added table with finetuning data distribution and clarified negative mining approach * Clarified consistency filtering approach * Reordered tables by descending score order and separated by model and sequence length Camera ready * fixed table descriptions * deanonymized

Code: https://github.com/nomic-ai/contrastors

Assigned Action Editor: ~Chunyuan_Li1

Submission Number: 3696

Loading