LLM-Guided Hard Negative Mining for Structured Product Data Matching

Published: 25 May 2026, Last Modified: 29 May 2026FMSD @ ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Structured Foundation Models, Structured Retrieval, Product Entity Matching, Hard Negative Mining, LLM-Guided Augmentation, Data-Efficient Retrieval.
TL;DR: We identify a structural limitation of in-batch contrastive training for structured product retrieval and show that LLM-synthesized hard negatives partially overcome it.
Abstract: Dense bi-encoder models for product entity matching rely on in-batch negatives during con- trastive training, which exposes the model only to semantically distant non-matches and leaves near-duplicate hard cases entirely unseen. We demonstrate empirically that this creates a per- sistent empirical ceiling on top-1 retrieval accu- racy (Acc@1): across the WDC LSPC Computers benchmark, Acc@1 remains flat at ∼4.5% regard- less of whether training data is increased 10-fold or training is extended to 15 epochs, and this ceil- ing replicates across all four WDC LSPC prod- uct categories (Computers, Cameras, Watches, Shoes). To partially overcome it, we propose LLM-HN, which uses GPT-4o-mini as a control- lable structured perturbation generator, synthe- sizing four typed hard negative types (phonetic, component-swap, abbreviation, semantic distrac- tor) under three prompting strategies (zero-shot, few-shot, chain-of-thought), providing attribute- level supervision unavailable to corpus-mining approaches at this data scale. A five-factor abla- tion reveals that chain-of-thought generation of component-swap negatives at a 4:1 ratio achieves Acc@1 = 0.0439 (±0.0014) and Acc@5 = 0.5506 (±0.0032) using only 10% of labeled data at un- der $1.50 in total API cost (mean ± std, three seeds), and Acc@1 = 0.1329 on the held-out test set.
Submission Number: 93
Loading