NanoLM: An Affordable LLM Study Benchmark via Accurate Loss Prediction Across Scales

Siqi Fan; Xiusheng Huang; Xuezhi Fang; Yiqun Yao; Xiang Li; Ziyi Ni; Xin Jiang; Xuying Meng; Peng Han; Shuo Shang; Kang Liu; Aixin Sun; Yequan Wang

NanoLM: An Affordable LLM Study Benchmark via Accurate Loss Prediction Across Scales

Siqi Fan, Xiusheng Huang, Xuezhi Fang, Yiqun Yao, Xiang Li, Ziyi Ni, Xin Jiang, Xuying Meng, Peng Han, Shuo Shang, Kang Liu, Aixin Sun, Yequan Wang

22 Sept 2023 (modified: 18 Jun 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Model, Scaling Law, Hyperparameter Transfer, Hyperparameter Tuning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: High computational cost, data collection, and difficulty in distributed training are the three significant barriers in pre-training large language models (LLMs) for many researchers. In this paper, we try to solve the question ''Under constrained computational resources, what type of model design(eg. model size, model architecture) should I train in order to to achieve the best possible performance?" To answer this question, based on Scaling Laws for LLM, we introduce nanoLM: an affordable LLM Study Benchmark via Accurate Loss Prediction across scales. This benchmark unlocks a new LLM study paradigm without direct training. Under the loss basin area, the training loss and model size can be accurately fitted as a power law. This allows us to extrapolate LM from small- to large-scale. For example, with just 13.1%, 14.2% of the total pretraining cost, we can accurately forecast the loss for models sized 26B and 52B. To ensure compatibility with mainstream Transformer architectures, nanoLM offers support for decoder-only structures (eg., GPT), encoder-only structures (eg., BERT), and encoder-decoder structures (eg., T5). Considering that excessive model parameters might lead to GPU memory overflow, nanoLM also supports for data parallelism strategies. Our goal with nanoLM is to empower researchers to make cheap and meaningful comparisons of varying model designs at large scales. We also aspire for our benchmark to serve as a bridge between the academic community and the industry.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4735

Loading