Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

Deepak Narayanan; Keshav Santhanam; Peter Henderson; Rishi Bommasani; Tony Lee; Percy Liang

Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, Percy Liang

Published: 21 Sept 2023, Last Modified: 16 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: Systems for Machine Learning, Inference efficiency, Transformer models, Text generation APIs, Capability-efficiency tradeoffs

TL;DR: We propose a new metric and methodology to compare inference efficiency across different autoregressive Transformer-based model APIs.

Abstract: Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the _fundamental tradeoff_ between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called _idealized runtime_, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our code is open sourced at https://github.com/stanford-crfm/helm-efficiency.

Submission Number: 5135

Loading