Keywords: Parallel Encoding; Efficient Inference; Long Context
Abstract: Many adaptive language model applications, such as RAG and ICL, require the efficient combination of multiple external contexts to generate a response. In this work, we explore the potential of parallel encoding to speedup generation and extend context by pre-caching the KV states of each context separately for direct loading and position reuse during inference. However, directly applying it reduces performance due to its misalignment with sequential encoding. To address this challenge, we propose APE, which brings a shared prefix, additional scaling factor, and lower attention temperature to align these two distributions of attention weights. Extensive experiments showcase APE improves performance by 7.8% over standard parallel encoding and 2.9% over sequential encoding for long contexts, while maintaining 93% accuracy in few-shot learning. For the efficiency evaluation, APE achieves a 976$\times$ speedup for 512K context-augmented generation with a 256-token response.
Submission Number: 83
Loading