Keywords: Contextual Bandits, Query Rewriting, Large Language Models, Hallucination Mitigation
TL;DR: QueryBandit uses a contextual bandit over 17 linguistic features to choose among five rewrite strategies, achieving an 87.5% win rate on perturbed QA queries (vs. 44.9% paraphrase, 27.2% expansion).
Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have caused higher hallucination prevalence; yet most mitigation work focuses on after-the-fact filtering rather than shaping the queries that trigger them. We introduce QueryBandits, a bandit framework that designs rewrite strategies to maximize a reward model, that encapsulates hallucination propensity based upon the sensitivities of 17 linguistic features of the input query-and therefore, proactively steer LLMs away from generating hallucinations. Across 13 diverse QA benchmarks and 1,050 lexically perturbed queries per dataset, our top contextual QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a no-rewrite baseline and also outperforms zero-shot static prompting ("paraphrase" or "expand") by 42.6% and 60.3% respectively. Therefore, we empirically substantiate the effectiveness of QueryBandits in mitigating hallucination via the intervention that takes the form of a query rewrite. Interestingly, certain static prompting strategies, which constitute a considerable number of current query rewriting literature, have a higher cumulative regret than the no-rewrite baseline, signifying that static rewrites can worsen hallucination. Moreover, we discover that the converged per-arm regression feature weight vectors substantiate that there is no single rewrite strategy optimal for all queries. In this context, guided rewriting via exploiting semantic features with QueryBandits can induce significant shifts in output behavior through forward-pass mechanisms, bypassing the need for retraining or gradient-based adaptation.
Submission Type: Research Paper (4-9 Pages)
NeurIPS Resubmit Bundle: pdf
NeurIPS Resubmit Summary: The reviewers highlighted our strengths from novel framing, empirical coverage, and interesting findings. Furthermore, in our rebuttals we addressed our reward formulation via Pareto optimization and AUC-ROC learning curves, inclusion of additional baselines (QueryBandits outperforms gpt-4o and open-source DoLa/ICD/TruthX). Our original scores were 443.
NeurIPS Resubmit Attestation: I am an author of the referenced NeurIPS 2025 submission. I have the right to share the anonymous reviews/meta-review for the exclusive use of the workshop PCs/reviewers. I understand they will not be redistributed publicly.
Submission Number: 143
Loading