Abstract: Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose \textbf{Retrieval-Augmented Decoding (RAD)}, a context-aware adaptive decoding method that leverages a compact reference grounding space built from \textit{as few as 10 annotated examples} and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across three open-ended question-answering benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jinghui_Chen1
Submission Number: 6791
Loading