Abstract: In this paper, we present an alternative approach to language model training that emphasizes data quality over sheer volume. Using the authoritative Encyclopaedia Britannica as corpus, we first limit the size to a 10-million-word dataset. We then expand the training of a specialized model for encyclopedic content generation with the complete 38-million-word corpus. Central to our approach is the use of knowledge distillation, which allowed us to train compact student models guided by larger teacher models, achieving high performance while significantly reducing model complexity. Building on the BabyLlama architecture, our findings reveal that high-quality, curated data combined with effective distillation techniques can facilitate efficient and effective learning. This work highlights promising directions for resource-constrained applications and specialized domain modeling. We will release our programs and models if this paper is accepted.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: distillation, parameter-efficient-training
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 1870
Loading