Q-VESA: Accelerating Quantization-Aware Vector Search for Fast Retrieval in Prompt Engineering

Seongjoon Cho, Junyoung Park, Donghyun Kang, Moohyeon Nam, Hongchan Roh, Moo-Kyoung Chung, Se-Hyun Yang, Seungkyu Choi

Published: 2026, Last Modified: 28 Feb 2026IEEE Trans. Computers 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Similarity search has drawn significant attention due to the growing demand for AI prompt engineering, which leverages retrieval-augmented generation (RAG) systems to maximize the efficiency of large language models (LLMs). Improving the performance of memory-intensive data retrieval processes using approximate nearest neighbor search (ANNS) algorithms has become increasingly crucial for delivering high-quality generative AI services. In this work, we propose Q-VESA, a software-hardware collaborative solution designed to accelerate low-precision graph-based vector search while preserving recall rates comparable to high-precision. We perform a comprehensive analysis of hierarchical navigable small-world (HNSW), one of the most promising graph-based ANNS methods, using recent datasets tailored for RAG systems. Unlike standard vector database datasets, vector data used in LLMs present unique challenges for low-precision search. To address these, we introduce software-oriented precision partitioning techniques that enable mixed-precision computations during graph traversal without compromising hardware performance. The Q-VESA architecture is developed with two key innovations: a database restructuring scheme with vector data segments, and a dedicated accelerator design to maximize throughput in distance computations. Experimental results demonstrate that Q-VESA on a CPU system achieves query-per-second (QPS) speedups of up to 1.81$\boldsymbol{\times}$ in the SIMD execution mode. Furthermore, leveraging minimal area overhead, the ASIC implementation delivers an additional speedup of up to 58.3$\boldsymbol{\times}$, 16.5$\boldsymbol{\times}$ and 1.5$\boldsymbol{\times}$ compared to the CPU, GPU and the state-of-the-art ASIC-based accelerator, respectively.

External IDs:dblp:journals/tc/ChoPKNRCYC26