Abstract: Approximate Nearest Neighbor Search (ANNS) is a fundamental technique in modern intelligent applications, including recommendation systems and vector databases. With the advent of large language models (LLMs), ANNS plays a critical role in enabling attention pruning mechanism that exploit the sparsity of attention, such as top-K attention and retrieval attention. As a result, the efficiency of ANNS has become increasingly crucial. In this paper, we identify a key inefficiency in state-of-the art ANNS methods based on product quantization: the redundant computation and accumulation of pairwise distance with codebook. To address this, we propose JUNO++, the system consists of i) an end-to-end ANNS search pipeline based on ray-tracing core leveraging sparsity-aware algorithm and ii) an integration of the ray-tracing based ANNS search pipeline to the attention computation. For ANNS search pipeline, evaluation on four datasets indicate 2.2x to 8.5x search throughput improvement. For ANNS-powered sparse attention, JUNO++ achieves a 46% reduction in latency of q × k⊤ calculation comparing to the baseline with almost identical accuracy, which is not only a key component of retrieval-based sparse attention, but also the dominant component in long-context scenario, implying a considerable end-to-end improvement.
External IDs:doi:10.1145/3768585
Loading