DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Anonymous

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: DynaMo accelerates inference using dynamic multi-token sampling from the predicted joint probability distribution.

Abstract: Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models dynamically predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57x speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

0 Replies

Loading