Abstract: Generative document retrieval (GDR) uses pre-trained Transformer-based large language models (LLMs) to extract contextual information and directly predict document identifier token sequences, outperforming traditional document retrieval methods. However, LLMs incur significant computational costs, hindering GDR’s practical application and making inference acceleration essential. Early exiting is one of the conditional computing techniques that expedites LLM inference, but it faces challenges when integrated into GDR due to GDR’s semantically hierarchical structured identifiers, which cause error amplification from premature exits. Moreover, although beam search expands the search space, the hierarchical structure of document identifiers restricts the diversity of initial tokens, leading to inefficiencies. In this work, we introduce Bi-Level Early Exiting for Generative Document Retrieval (BiLEE), comprising Layer Level Early Exiting (LLEE) and Token Level Early Exiting (TLEE). LLEEare designed for hierarchical document identifiers, dynamically escaping from the middle layer of the Transformer calculation based on a data-driven calibrated token threshold. TLEE exiting from unpromising candidate sequences, thus discarding unpromising search beams and enhancing beam search efficiency. Both components dynamically balance the speed-to-accuracy trade-offs for different token positions, doubling GDR’s inference speed and obtaining 13× reduction for FLOPs while maintaining the same level of accuracy. Source code: unmapped: uri https://github.com/Rui-Fang/BiLEE.
Loading