Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark.
We release the training code and model weights under an Apache 2.0 license.
In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}
Certifications: Reproducibility Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=dmWHFxu3Cq&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: * Fixed accidental styling override
* Small typo fixes
* Table title formatting to above table
Rebuttal revisions
* Removed background on RoPE
* Added longer descriptions to tables
* Standardized B -> bilion, M -> million
* Added table with finetuning data distribution and clarified negative mining approach
* Clarified consistency filtering approach
* Reordered tables by descending score order and separated by model and sequence length
Camera ready
* fixed table descriptions
* deanonymized
Code: https://github.com/nomic-ai/contrastors
Assigned Action Editor: ~Chunyuan_Li1
Submission Number: 3696
Loading