Abstract: Transformers have dominated the field of natural language processing, attributed to their capability to handle sequential input data. There is a surge of work on computational and networking optimizations, aimed at improving the training efficiency of Transformers. However, transformer inference, a cornerstone of myriad AI services, remains relatively underexplored. With the challenge of variable-length inputs, conventional methods adopt padding schemes, resulting in computational waste. Moreover, works on transformer inference often overlook the integration between request scheduling and batching, which play pivotal roles in inference systems. To address these challenges, we introduce TCB, a comprehensive Transformer inference system that integrates a ConcatBatching scheme to reduce computational redundancy by concatenating requests. In addition, we present an online request batching algorithm, designed to augment the throughput of scheduled requests. Consider a muiti-server case, we further introduce a joint request assignment and batching scheduling policy to fully utilize resources on servers while ensuring quality-of-service of inference. Extensive experiments demonstrate that our proposed methods can significantly outperform existing works.
External IDs:dblp:journals/tsusc/FuCLZ25
Loading