Nomic Embed: Training a Reproducible Long Context Text Embedder

Published: 26 Feb 2025, Last Modified: 26 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}
Certifications: Reproducibility Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=dmWHFxu3Cq&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: * Fixed accidental styling override * Small typo fixes * Table title formatting to above table Rebuttal revisions * Removed background on RoPE * Added longer descriptions to tables * Standardized B -> bilion, M -> million * Added table with finetuning data distribution and clarified negative mining approach * Clarified consistency filtering approach * Reordered tables by descending score order and separated by model and sequence length Camera ready * fixed table descriptions * deanonymized
Code: https://github.com/nomic-ai/contrastors
Assigned Action Editor: ~Chunyuan_Li1
Submission Number: 3696
Loading