Next-Word Prediction: A Perspective of Energy-Aware Distributed Inference

Published: 01 Jan 2024, Last Modified: 16 May 2025IEEE Trans. Mob. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The pursuit of high-quality artificial intelligence generated contents (AIGC) with fast response has prompted the evolution of natural language processing (NLP) services, notably those enabled at the edge (i.e., edge NLP). For concreteness, we study distributed inference for next-word prediction which is a prevalent edge NLP service for mobile keyboards on user devices. Accordingly, we optimize coupled metrics, i.e., maximize prediction click-through rate (CTR) for improved quality-of-service (QoS), minimize user impatience for enhanced quality-of-experience (QoE), and keep energy consumption within budget for sustainability. Moreover, we consider the real-world setting where there is no prior knowledge of heterogeneous NLP models’ prediction accuracy. Via an integration of online learning and online control, we propose a novel distributed inference algorithm for online next-word prediction with user impatience (DONUT) to estimate models’ prediction accuracy and balance the trade-offs among coupled metrics. Our theoretical analysis reveals that DONUT achieves sub-linear regret (loss of CTR), ensures bounded user impatience, and maintains within-budget energy consumption. Through numerical simulations, we not only establish DONUT's superior performance over other baseline methods, but also demonstrate its adaptability to various settings.
Loading