Adaptive Structured Transformation: Mitigating Distribution Shift in Dense Retrieval Through Training-Time Preprocessing

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: dense retrieval, distribution shift, negative transfer, document transformation
TL;DR: Training dense retrievers on structurally transformed documents prevents performance degradation from misaligned synthetic training data.
Abstract: Dense retrieval models are trained assuming that finetuning on task-relevant queries improves performance, yet this assumption can break down when training data contains synthetic components or originates from a misaligned distribution with target tasks. We find that in such scenarios, naively finetuning on seemingly relevant data can result in negative transfer, causing significant degradations over not finetuning at all. We propose Adaptive Structured Transforms (AStrucT), an automatic preprocessing technique that leverages off-the-shelf Large Language Models (LLMs) to organize training documents into domain-specific structures prior to finetuning. These domain-specific schemas are generated based on a small sample of target-domain passages; notably, this induction is fully automated---no human inspection of the target documents and no knowledge of test-time queries---preserving data privacy in settings where manual review is restricted. Across three model scales and twelve diverse domains (BRIGHT), AStrucT yields an average improvement of 3.77 percentage points (pp) nDCG@10 over direct finetuning, and 1.10 pp over the pretrained baseline, consistently mitigating negative transfer. Furthermore, we find that these improvements are driven by the structural inductive bias of the transformation rather than the density of query-conditioned content retention. These findings provide a practical strategy for practitioners to finetune an embedding model for retrieval without eyes-on access to the underlying documents or test-time queries.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 109
Loading