JUNO++: Optimizing ANNS and Enabling Efficient Sparse Attention in LLM via Ray Tracing Core

Zihan Liu, Wentao Ni, Jingwen Leng, Yu Feng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, Yufei Ma, Feng Zhang, Yun Liang

Published: 18 Sept 2025, Last Modified: 29 Nov 2025ACM Transactions on Architecture and Code OptimizationEveryoneRevisionsCC BY-SA 4.0
Abstract: Approximate Nearest Neighbor Search (ANNS) is a fundamental technique in modern intelligent applications, including recommendation systems and vector databases. With the advent of large language models (LLMs), ANNS plays a critical role in enabling attention pruning mechanism that exploit the sparsity of attention, such as top-K attention and retrieval attention. As a result, the efficiency of ANNS has become increasingly crucial. In this paper, we identify a key inefficiency in state-of-the art ANNS methods based on product quantization: the redundant computation and accumulation of pairwise distance with codebook. To address this, we propose JUNO++, the system consists of i) an end-to-end ANNS search pipeline based on ray-tracing core leveraging sparsity-aware algorithm and ii) an integration of the ray-tracing based ANNS search pipeline to the attention computation. For ANNS search pipeline, evaluation on four datasets indicate 2.2x to 8.5x search throughput improvement. For ANNS-powered sparse attention, JUNO++ achieves a 46% reduction in latency of q × k⊤ calculation comparing to the baseline with almost identical accuracy, which is not only a key component of retrieval-based sparse attention, but also the dominant component in long-context scenario, implying a considerable end-to-end improvement.
External IDs:doi:10.1145/3768585
Loading