The Promises and Pitfalls of Language Models for Structured Numerical Data

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, tokenization, transformers, inductive biases, quantum chemistry
Abstract: Autoregressive language models are increasingly capable of processing non-text data, such as images or audio. Are language models also a natural choice for numerical data, such as the 3D structure of molecules? In this work, we use quantum chemistry simulations as a case study in the challenges of applying language models to numerical data, building up a set of simple subproblems that can shed light on key design decisions. We show that language models lag behind domain-specific models on prediction tasks and provide evidence for and against different hypotheses that explain their failure. Many commonly identified pitfalls such as difficulty performing arithmetic operations and choice of discrete vocabulary fall short of explaining the behavior. In contrast, we show that capturing invariance properties exhibits a strong correlation with predictive performance. Finally, we provide a comparison of language models trained from scratch on numerical data with models pretrained on text. We show that text pretraining often provides a surprisingly limited advantage on prediction tasks, and can even hurt performance, despite prior work showing that text-pretraining can offer advantages.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12424
Loading