MUX-PLMs: Data Multiplexing for High-throughput Language Models

Vishvak Murahari; Ameet Deshpande; Carlos E Jimenez; Izhak Shafran; Mingqiu Wang; Yuan Cao; Karthik R Narasimhan

MUX-PLMs: Data Multiplexing for High-throughput Language Models

Vishvak Murahari, Ameet Deshpande, Carlos E Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik R Narasimhan

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Machine Learning for NLP

Submission Track 2: Efficient Methods for NLP

Keywords: Efficient Inference, Multi-input Multi-output architectures, Data Multiplexing

TL;DR: We develop MUX-PLMs, a class of high-throughput high performance language models trained with multi-input multi-output (MIMO) architectures.

Abstract: The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a promising solution with a many-fold increase in throughput by performing inference for multiple inputs at the cost of a single input. Yet these approaches are not currently performant enough to be deployed in modern systems. We change that by developing MUX-PLMs, a class of high throughput pre-trained language models (PLMs) trained with data multiplexing, that can be fine-tuned for any downstream task to yield high-throughput high-performance. Our novel multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput MUX-PLMs that are competitive with vanilla PLMs while achieving 2x/5x inference speedup with only a 1-4 % drop on a broad suite of tasks.

Submission Number: 4398

Loading