Keywords: Tokenization, Tokenizer, LLM, Patch Encoding, Scaling Behavior, Scientific Computing, Particle Physics
Abstract: Despite the revolutionary success of large language models (LLMs) in natural language processing and task agents, their application to real-world scientific problems—particularly in domains involving large-scale numerical data—remains challenging. This limitation stems primarily from the inefficiency of the Byte-Pair-Encoding (BPE) method when handling numerical data. To address this gap, we propose a binary patch encoding method and integrate it into an LLM architecture named BigBang-Neutron, which can efficiently process mixed textual and large-scale numerical datasets. We demonstrate the efficacy of our method on Jet Origin Identification (JoI), a critical categorization task in high-energy physics that distinguishes jets originating from different quarks or gluons, which is one of the number-intensive classification problems in particle physics. Experimental results show that BigBang-Neutron achieves performance comparable to state-of-the-art task-specific JoI models. Furthermore, we investigate the scaling behavior of BigBang-Neutron’s performance with increasing data volume, indicating its potential to serve as a foundational model for particle physics data analysis and holds promise for extension to a broad range of scientific computing applications. The project code is publicly available on GitHub, which will be provided after the manuscript is accepted.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: pre-training, scaling, representation learning, word embeddings
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Scientific data, English
Submission Number: 1821
Loading