PragAURA: Speech‑Act–Guided Retrieval Allocation and Calibrated Abstention for Reliable RAG

Jingting Zheng; Deyi Xiong

PragAURA: Speech‑Act–Guided Retrieval Allocation and Calibrated Abstention for Reliable RAG

Jingting Zheng, Deyi Xiong

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval‑Augmented Generation (RAG), Pragmatics / Speech Acts, Selective Prediction & Abstention, Uncertainty Calibration & Split‑Conformal, Test‑Time Compute Allocation (Budgeted Inference)

TL;DR: Training‑free PragAURA uses speech‑act cues to route retrieval and calibrate abstention; at fixed compute it improves selective EM/F1 and risk–coverage vs a calibrated global‑τ baseline, with split‑conformal as reference.

Abstract: Retrieval‑augmented generation (RAG) often allocates test‑time compute uniformly and answers even when evidence is weak or conflicting, undermining factuality, groundedness and safety. We introduce PragAURA, a training‑free strategy that unifies retrieval allocation and abstention by conditioning both on the input’s speech‑act cues. Given a query, PragAURA routes it to act‑specific retrieval profiles, covering the BM25/dense mix, re‑rank depth and evidence genre, composes evidence under a fixed compute budget, and calibrates selective prediction with an uncertainty score that aggregates inter‑branch disagreement, snippet‑level conflicts and evidence‑to‑answer entailment. We pose two questions: (1) Under matched budgets, how much reliability‑per‑compute does act‑conditioned allocation recover over a global threshold? (2) Can per‑act calibration yield favorable risk–coverage trade‑offs against calibrated and split‑conformal baselines? On a 10% SQuAD validation slice, a global‑τ baseline abstains on 44% at Recall@10 = 0.910; enabling conflict‑aware allocation reduces abstention to 23% at unchanged retrieval quality, and per‑act τ further lowers it to 20% while improving Recall@10 = 0.920. On a HotpotQA slice, targeting 30% abstention attains Recall@10 = 0.967. We report selective EM/F1 vs. coverage on SQuAD and replicate risk–coverage behavior on a HotpotQA slice, all at compute parity, i.e. docs scored / ms per query. We compare against a calibrated global‑τ and a lightweight split‑conformal threshold computed on a small calibration split. Without retriever retraining, and with transparent linguistic grounding via speech acts, PragAURA offers a simple, reproducible test‑time scaling policy that improves coverage at fixed risk and compute for reliable RAG.

Supplementary Material: zip

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Submission Number: 23734

Loading