xVal: A Continuous Numerical Tokenization for Scientific Language Models

TMLR Paper3914 Authors

09 Jan 2025 (modified: 21 Mar 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yoshitomo_Matsubara1
Submission Number: 3914
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview