Multimodality extension to Universal Multilingual BPE Text Tokenizer

Priya Talwalkar

Multimodality extension to Universal Multilingual BPE Text Tokenizer

Priya Talwalkar

Published: 22 Sept 2025, Last Modified: 27 Nov 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodality, Tokenizer

Abstract: Multimodality extension to Universal Multilingual BPE Text Tokenizer This abstract is proposing Multimodality extension to the paper One Tokenizer To Rule Them All..[1]. The referenced paper [1] mainly uses bucketed weighting scheme on unseen/expanded languages by script e.g. (Devnagari, Hindi) or (Latin, Polish) pair and trains Byte-Pair Encoding (BPE) model on diverse text corpus across ~69 languages (combination of languages used in pretraining and many others that are only intended for tokenizer coverage). It also provides byte-fallback for edge cases outside training data. This abstract is about concrete enhancements to the above paper’s [1] bucketed weighting scheme and explicitly accounts for multiple modalities like Images, Speech, OCR, Text etc. The below enhancements are aimed at achieving >= 0.95 CMS score to consider the resultant tokenizer as Multimodal tokenizer. 1.Modality-aware buckets[2] Extend buckets to [script, modality] pairs e.g., (Devanagari, OCR), (Arabic, ASR) and assign higher sampling weights to underrepresented pairs and monitor per-bucket coverage in tokens/word and bytes/token. Keeping total vocabulary same; train BPE on weighted samples. Measure improvement in OCR/ASR-related tasks for same scripts; per-bucket tokens/word. Success: ≥ small positive lift (0.5–2 pts) on OCR-heavy tasks for those scripts vs baseline.2.Confidence-weighted sampling Use OCR/ASR confidence scores to downweight low-confidence examples or to preferentially sample medium-confidence ones for tokenizer training. Integrate OCR/ASR confidence scores and sample with prob ∝ (α + conf^β). Precompute confidences; tune α (e.g., 0.05) and β (e.g., 1.0→2.0). Keep some low-confidence included via α. Measure: Token noise (tokens seen only in low-confidence data), downstream VQA/DocVQA on OCR; stability of merges. Success: Cleaner merges (fewer spurious tokens) and small downstream improvement; reduction in tokens primarily seen in noisy buckets.3. Adaptive Reweighting with Feedback[3] During tokenizer training, periodically evaluate downstream proxy tasks (small VQA/ASR validation slices). Reweight buckets that show poor downstream performance. Every N steps, compute per-bucket validation loss; increase sampling weight for buckets with high loss (up to a cap). Measure: Convergence speed on proxies; stability of vocab. Success: Faster improvements on held-out proxies; stable vocabulary.4. Cross-Modal Coverage Balancing[4] Upweight text segments that are aligned to images/speech (e.g., OCR region + image) so merges capture visually-grounded tokens by marking multimodal aligned text and multiply sample weight by γ (1.5–3.0) and measure improvement in grounded retrieval/DocVQA EM and fewer mis-OCR tokens. Success: Noticeable lift on grounding tasks (>= 1-3 pts) 5. Curriculum-Based Bucket Scheduling[5] Systematic phases in the training as in Phase A, clean high-confidence multimodal pairs. Phase B: gradually add noisy/augmented examples for N steps. Measure: Merge stability (fewer reversions), downstream robustness to noisy OCR. Success: Better OCR robustness and fewer low-quality tokens. 6. Multimodal-Aware Validation Metrics[6][7][8][9] We want metrics that reflect multimodal performance, not just perplexity or compression. Composite Multimodal Score (CMS) should be >= 0.95 AND no single Primary Metric falls below target. CMS = Σ wi ⋅ Mi/Ti Mi = measured metric value (Refer Metric column from side table) Ti = target threshold (so ratio ≥1 is “good”) wi = weight assigned to modality (wi values = [Core text = 20%, OCR = 25%, ASR = 25%, DocVQA = 15%, Captions = 15%]) References - [1] https://arxiv.org/pdf/2506.10766 [2] https://www.rohan-paul.com/p/multilingual-and-multimodal-llms [3] https://aclanthology.org/D17-1158/ [4] https://arxiv.org/abs/1909.11740 [5] https://dl.acm.org/doi/10.1145/1553374. 1553380[6] https://arxiv.org/abs/2007. 00398 [7]https://arxiv.org/abs/1904.08920[8] https://www.openslr.org/12[9] https://commonvoice.mozilla.org/en Tools used – ChatGPT for validating the approaches and finding the references to the proposed approaches in this abstract for citation purpose.

Submission Number: 56

Loading