Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: sentence embeddings, mean bias, anisotropy, post-processing, benchmark evaluation, error propagation, MMTEB, falsifiable hypotheses
TL;DR: Sentence embeddings carry a consistent mean bias. We propose two training-free corrections and show across 38 models that mild removal helps but full PCA whitening hurts.
Abstract: We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + \mu$, where the mean $\mu$ is near-identical across all sentences. We study two training-free corrections---subtracting $\mu$ directly (R1), or projecting each embedding off the mean direction (R2)---and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB), R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vert\mu\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hat\mu$ and the centered top principal component.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 165
Loading