Large Language Model Inference with Lexical Shortlisting

Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch

Published: 2023, Last Modified: 07 Dec 2023CoRR 2023Readers: Everyone

Abstract: Large language model (LLM) inference is computation and memory intensive, so we adapt lexical shortlisting to it hoping to improve both. While lexical shortlisting is well-explored in tasks like machine translation, it requires modifications before being suitable for LLMs as the intended applications vary significantly. Our work studies two heuristics to shortlist sub-vocabulary at LLM inference time: Unicode-based script filtering and corpus-based selection. We explore different LLM families and sizes, and we find that lexical shortlisting can reduce the memory usage of some models by nearly 50\% and has an upper bound of 25\% improvement in generation speed. In this pilot study, we also identify the drawbacks of such vocabulary selection methods and propose avenues for future research.

0 Replies