Abstract: etrieval-augmented generation (RAG) systems increasingly rely on evidence gathered from diverse external sources.
A persistent challenge lies in how these systems integrate heterogeneous knowledge: some strategies merge all retrieved content without distinction, while others commit to a single source, thereby neglecting complementary semantics.
At the same time, common re-ranking modules often emphasize heuristic relevance scores, leaving unexplored the actual effect of retrieved evidence on the generation objective itself.
This work explores a gradient-informed selection framework for multi-source RAG, designed to be training-free yet tightly coupled with the generation objective.
The key idea is to perform a lightweight backward pass on the language model to estimate each candidate document’s marginal contribution to the generation loss.
By using this signal for subset selection, the framework shifts the focus from relevance-driven ranking to optimization-driven evidence choice.
In addition, query alignment and redundancy across sources are jointly considered to identify which source combinations should be retained before the re-ranking step.
We theoretically show that this gradient-guided procedure approximates the subset that minimizes generation loss and aligns with minimizing a leave-one-out upper bound.
Experiments on multi-source QA and open-domain generation demonstrate consistent gains in response quality, underscoring the importance of generation-aware retrieval strategies in multi-source settings.
Loading