Keywords: Vector Search, Approximated Nearest Neighbor, Retrieval-Augmented Generation
Abstract: Graph-based vector search underpins modern LLM applications such as retrieval-augmented generation (RAG), but its efficiency is increasingly constrained by disk I/O.
Existing systems continue searching long after discovering the higher-ranked (i.e., most valuable) results for downstream applications.
We present Terminus, a rank-aware early termination mechanism that dynamically aligns I/O spending with application utility.
Terminus models per-I/O search utility using a rank-weighted function and terminates once recent I/Os yield negligible utility gains. By adaptively terminating search based on rank-aware signals, Terminus improves recovery of top-ranked results that matter most for downstream tasks, achieving a better performance–accuracy trade-off. It delivers up to 1.4× higher throughput at the same accuracy target compared to existing early termination schemes, and up to 3.2× higher throughput than a baseline without early termination, with minimal impact on RAG accuracy.
Topics: Agentic Systems: Systems optimizations for agentic AI applications
Submission Number: 9
Loading