TL;DR: We explore the feasibility of using Lexical shortlisting to improve inference speed in large language models without impacting quality.
Abstract: Deploying large language models (LLMs) often encounters challenges due to intensive computational and memory requirements. Our research delves into lexical shortlisting, aiming to bolster efficiency and deployment readiness. While lexical shortlisting has been shown effective in tasks like machine translation, tailoring them to LLMs demands specific modifications given the diverse nature of their applications. We study two heuristics to shortlist sub-vocabulary at LLM inference time: Unicode-based script filtering and corpus-based selection. The work explores different LLM families and sizes. It is observed that lexical shortlisting can reduce the memory usage of some models by nearly 50% and has an upper bound of 25% improvement in generation speed. This preliminary study delineates the strengths of vocabulary selection, acknowledges the limitations of these methods, and finally proposes future avenues for refining.
Paper Type: short
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English, Bulgarian, Spanish, Chinese
0 Replies
Loading