From Noise to Signal: Enabling Foundation-Model Pretraining on Noisy, Real-World Corpora via Quality-Aware Tokenization
Keywords: Tokenization, Adaptive Learning, Reinforcement Learning, Hyperparameter Optimization, Genomics, Quantitative Finance, Natural Language Processing, Foundation Models
TL;DR: QA-Token enables foundation-model pretraining on noisy corpora by learning quality-aware vocabularies.
Abstract: Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present *QA-Token (Quality-Aware Tokenization)*, which incorporates data reliability directly into vocabulary construction. Our framework introduces three technical contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance (proven NP-hard), (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization.
We show that QA-Token achieves information-theoretic optimality under noisy conditions, with convergence guarantees for both policy and parameter learning. Experiments demonstrate consistent improvements: *genomics* (8.9% absolute F1 gain in variant calling, Hedges' *g*=8.2), *finance* (30% Sharpe ratio improvement). At foundation scale, re-tokenizing METAGENE-1's 1.7 trillion base-pair corpus achieves state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. A 1.2B parameter financial model trained with QA-Token shows 12-27% improvements across forecasting tasks. These results demonstrate that quality-aware tokenization enables effective training on noisy corpora that standard methods cannot handle.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 21805
Loading