Abstract: Approximate Nearest Neighbor Search (ANNS) is widely used in various fields, including database systems, recommendation engines, and large language models. As data dimension and size continue to expand, many studies explore GPU acceleration for graph-based ANNS. While previous methods use large batch to maximize throughput, they often lead to increased latency. In contrast, small batch is more effective for online lowlatency applications, as it minimizes batch accumulation time. However, employing small batch on GPU presents challenges. First, the query bubble issue in batch processing negatively impacts both latency and GPU utilization. Second, current GPU search methods incur excessive sorting overhead, and introduce additional TopK-merging overhead on GPU. To address these challenges, we propose ALGAS, a low-latency GPU search system designed for small batch. ALGAS employs dynamic batching based on persistent GPU kernel function to optimize query bubble. Additionally, it employs beam extend to reduce sorting overhead, especially effective at high recall rate. It also eliminates TopK-merging overhead via GPU-CPU cooperation. Furthermore, it employs an adaptive GPU tuning scheme to optimize resource utilization. We compare ALGAS with the state-of-the-art graph-based works. ALGAS reduces latency by up to 21.9%-35.4% and increases throughput by up to 27.8%-55.2% under various real-world datasets.
External IDs:dblp:conf/ipps/ChenCYZW025
Loading