Regional TinyStories: Using Small Models to Compare Language Learning  and Tokenizer Performance

Regional TinyStories: Using Small Models to Compare Language Learning and Tokenizer Performance

ACL ARR 2025 May Submission6993 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The 2023 TinyStories project showed that small language models (SLMs) with under \$10\$ million parameters can generate coherent English stories when trained on carefully curated datasets. In this work, we extend the framework to Hindi, Marathi, and Bangla by using both machine-translated and LLM-generated datasets and training SLMs up to \$\approx 150\$ million parameters. We find that SLMs can produce high-quality stories in Indian languages using far fewer parameters than large models. Additionally, we offer a complementary framework by using the LLM-as-judge concept for an "inference score-based evaluation" of tokenization strategies and linguistic attribute learning. Our analysis reveals that language-specific tokenizers outperform general-purpose ones for Indian languages. Hindi models perform the strongest overall, achieving high scores in grammar, fluency, and context, supported by lower tokenization entropy and better morphological alignment. Each language exhibits different scaling behavior—Hindi benefits from wider models, Bangla emphasizes creativity with balanced setups, and Marathi requires more capacity due to its higher morphological complexity. Neural metrics-based evaluations like COMET-DA and LaBSE reinforce these observations with regard to content fidelity and semantic similarity. Synthetic datasets outperform translated ones by \$15\$–\$30%\$. Our results advance both the practical application of SLMs to underserved languages and the theoretical understanding of neural language development.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Natural Language Processing, Multilingualism, Cross-Lingual NLP, Indic NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Hindi, Marathi, Bangla

Submission Number: 6993

Loading