Taming the Noisy Oracle: Robust Entity-Centric Question Answering via Learning from Imperfect Feedback

Binyamin Perets; Zohar Shnaider; Dvir Aran; Shie Mannor

Taming the Noisy Oracle: Robust Entity-Centric Question Answering via Learning from Imperfect Feedback

Binyamin Perets, Zohar Shnaider, Dvir Aran, Shie Mannor

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Entity-Centric Question Answering (ECQA), Robust Machine Learning, Multi-Armed Bandits, Imperfect Data

TL;DR: we reframe question answering as a multi-armed bandit problem, creating a robust and cost-effective framework that strategically learns from the LLMs noisy and imperfect feedback.

Abstract: Entity-centric question answering (ECQA) is the problem of selecting which entities from a large, predefined set are most relevant to given observations. This task highlights a critical challenge for robust machine learning: reliably extracting factual knowledge from LLMs when they are treated as imperfect, black-box information sources, especially with long, heterogeneous inputs. For example, given genes active in a disease, scientists want to identify which biological processes are involved—a task demanding high reliability. Current approaches attempt to achieve robustness through consensus ranking or iterative validation, but these methods incur "token explosion," where costs scale poorly, making them impractical. We introduce ARISE (Adaptive Residual Information Sampling Engine), a framework that reframes ECQA as a problem of sequential decision-making under structured, imperfect feedback. Our key insight is that each query provides a form of biased data: noisy side-observations about related entities. We leverage this insight with DUETS Bandit (DUal Experts for Turbid side-Observations with Stochastic feedback graph), a novel online learning algorithm designed for this setting. DUETS employs dual expert advisors to navigate this uncertainty: a GraphExpert that models prior knowledge as a stochastic feedback graph to handle data biases, and a NoiseExpert that strategically queries the LLM to maximize observation quality, while Confirmation Atoms validate outputs to update internal beliefs in this interactive environment. This architecture enables statistically rigorous hypothesis testing with formal p-values, creating a robust and reliable system that dramatically reduces query complexity. Preliminary results on synthetic data are promising, and we are currently evaluating ARISE on the challenge of pathway enrichment analysis using 180+ annotated gene expression datasets, a domain where robustness to distribution shift (novel experimental data) is paramount.

Submission Number: 105

Loading