Abstract: The 2023 TinyStories project showed that small language models (SLMs) with under \$10\$ million parameters can generate coherent English stories when trained on carefully curated datasets. In this work, we extend the framework to Hindi, Marathi, and Bangla by using both machine-translated and LLM-generated datasets and training SLMs up to \$\approx 150\$ million parameters. We find that SLMs can produce high-quality stories in Indian languages using far fewer parameters than large models. Additionally, we offer a complementary framework by using the LLM-as-judge concept for an "inference score-based evaluation" of tokenization strategies and linguistic attribute learning. Our analysis reveals that language-specific tokenizers outperform general-purpose ones for Indian languages. Hindi models perform the strongest overall, achieving high scores in grammar, fluency, and context, supported by lower tokenization entropy and better morphological alignment. Each language exhibits different scaling behavior—Hindi benefits from wider models, Bangla emphasizes creativity with balanced setups, and Marathi requires more capacity due to its higher morphological complexity. Neural metrics-based evaluations like COMET-DA and LaBSE reinforce these observations with regard to content fidelity and semantic similarity. Synthetic datasets outperform translated ones by \$15\$–\$30%\$. Our results advance both the practical application of SLMs to underserved languages and the theoretical understanding of neural language development.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Natural Language Processing, Multilingualism, Cross-Lingual NLP, Indic NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Hindi, Marathi, Bangla
Submission Number: 6993
Loading