Binary Patch Encoding for Numerically Intensive Scientific Computing with Large Language Models

Binary Patch Encoding for Numerically Intensive Scientific Computing with Large Language Models

ACL ARR 2026 January Submission1821 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tokenization, Tokenizer, LLM, Patch Encoding, Scaling Behavior, Scientific Computing, Particle Physics

Abstract: Despite the revolutionary success of large language models (LLMs) in natural language processing and task agents, their application to real-world scientific problems—particularly in domains involving large-scale numerical data—remains challenging. This limitation stems primarily from the inefficiency of the Byte-Pair-Encoding (BPE) method when handling numerical data. To address this gap, we propose a binary patch encoding method and integrate it into an LLM architecture named BigBang-Neutron, which can efficiently process mixed textual and large-scale numerical datasets. We demonstrate the efficacy of our method on Jet Origin Identification (JoI), a critical categorization task in high-energy physics that distinguishes jets originating from different quarks or gluons, which is one of the number-intensive classification problems in particle physics. Experimental results show that BigBang-Neutron achieves performance comparable to state-of-the-art task-specific JoI models. Furthermore, we investigate the scaling behavior of BigBang-Neutron’s performance with increasing data volume, indicating its potential to serve as a foundational model for particle physics data analysis and holds promise for extension to a broad range of scientific computing applications. The project code is publicly available on GitHub, which will be provided after the manuscript is accepted.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: pre-training, scaling, representation learning, word embeddings

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: Scientific data, English

Submission Number: 1821

Loading