Scaling Inference-Efficient Language Models

Song Bian; Minghao Yan; Shivaram Venkataraman

Scaling Inference-Efficient Language Models

Song Bian, Minghao Yan, Shivaram Venkataraman

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Applying scaling laws to train inference-efficient models

Abstract: Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to $3.5\times$ difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by $1.8\times$ while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.

Lay Summary: Computers need large AI models to solve complex tasks, but these models are often slow and expensive to use. We asked: can we build models that are both fast to run and accurate? To answer this, we studied how different design choices—such as the depth or width of a model—affect its efficiency. We trained 63 models of varying sizes and amounts of training data to discover patterns that link size, training, and speed. These insights enable us to build a new model, Morph-1B, which is up to 1.8 times faster at making predictions while still performing well on real-world benchmarks. This shows that smarter model designs can help us get the best of both worlds: accuracy and speed. Our work helps AI developers create more sustainable and accessible systems without sacrificing performance.

Link To Code: https://github.com/Waterpine/open-lm-morph

Primary Area: Deep Learning->Large Language Models

Keywords: Scaling Laws, Inference, Language Models

Submission Number: 5235

Loading