BEYOND RETRIEVAL: GENERATIVE EVIDENCE CALIBRATION FOR ANSWER-UTILITY SEARCH

M Mostagir Bhuiyan

BEYOND RETRIEVAL: GENERATIVE EVIDENCE CALIBRATION FOR ANSWER-UTILITY SEARCH

M Mostagir Bhuiyan

02 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: head-safe calibration, product-of-experts, generative evidence, hybrid dense–sparse retrieval, early precision, BEIR, oracle upper bound (OUB) & reachability, reranking vs calibration, WRRF blending

TL;DR: GEC calibrates BM25+BGE with portfolio-derived evidence via a guarded PoE. On NQ/FiQA/SciFact it improves MRR@10 over strong fusions and converts a measurable share of oracle headroom without heavy CE reranking or head damage.

Abstract: Strong BM25+BGE fusions often saturate at the head; heavy rerankers (CE) do not consistently help. We introduce GEC, combining BM25, BGE, and Multi-GES via gPoE-HeadSafe, a calibrated product-of-experts with explicit head-safety guards. We quantify headroom with an OUB, bound what is reachable with PRA, and convert part of that gap via an APC pass. On NQ/FiQA/SciFact, gPoE-HeadSafe and/or GEC-WRRF improve MRR@10 over strong BM25+BGE fusions; APC captures a measurable fraction of OUB headroom (0.147 average gap, 58.3% reachable). The pipeline holds across Mixtral-8x7B-Instruct-v0.1 and Mistral-7B-Instruct-v0.3 with consistent early-precision gains; in our runs, CE is typically outperformed by gPoE-HeadSafe by ∼0.06–0.08 MRR@10 on Mixtral and ∼0.03–0.05 on Mistral.

Primary Area: datasets and benchmarks

Supplementary Material: zip

Submission Number: 1084

Loading