Abstract: Large language models (LLMs) have achieved remarkable success in high-resource languages, yet progress for Tibetan remains severely constrained by the lack of large-scale, high-quality, and structured data. Existing Tibetan resources are fragmented, domain-limited, and insufficient to support modern LLM pipelines requiring pretraining, instruction tuning, safety alignment, and reasoning supervision. We introduce the \textbf{T}ibetan \textbf{F}oundation \textbf{D}ataset (\textbf{TFD}), the first comprehensive, large-scale, and expert-curated dataset explicitly designed for Tibetan large language modeling. \textit{TFD} comprises two complementary components: \textit{TIBSTC}, a unified corpus of over 11 billion tokens spanning literature, law, medicine, religion, and everyday communication, and \textit{TIBSTC-CoT}, the first large-scale Tibetan chain-of-thought dataset supporting explicit multi-step reasoning across diverse domains. Unlike prior Tibetan datasets, \textit{TFD} is structurally organized to support the full LLM development lifecycle, including pretraining, supervised fine-tuning, safety alignment, and preference optimization. We demonstrate its utility by training the \textit{Sun-Shine} family of Tibetan LLMs and evaluating them on understanding, safety, reasoning, and generation tasks. Results show consistent improvements over strong open-source and proprietary baselines, underscoring the importance of large-scale, structured data for low-resource language modeling. We release \textit{TFD} to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.
External IDs:dblp:journals/corr/abs-2503-18288
Loading