SLlama: Parameter-Efficient Language Model Architecture for Enhanced Linguistic Competence Under Strict Data Constraints

ACL ARR 2025 May Submission6239 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Scaling data and model size has driven recent advances in language modeling, but this strategy falters under scenarios with strict data constraints, as in the BabyLM Challenge. However, insights from Chinchilla highlights that smaller models trained on more data outperform larger counterparts trained inadequately, emphasizing the need for compact architectures. Furthermore, while embedding weight tying is a common parameter-saving technique, we find it significantly diminishes linguistic competence in compact models. In response, we explore alternative architectural strategies that preserve the parameter efficiency of tied models without sacrificing the representational benefits of untied embeddings. Consequently, we introduce SLlama a Llama3 architecture variant which incorporates targeted modifications - Repeated Reduced Hidden Size and Projection (RRHP), Permutated Weight Attention (PWA), Shared Projection Multi-Layer Perceptron (SPMLP), and Layer Weight Sharing - to compress transformer components. Without relying on distillation, SLlama achieves a 31.72% improvement in linguistic knowledge acquisition over the BabyLlama baseline, with a comparable GLUE score and significantly lower parameter count. These results demonstrate that well-designed, compact models can rival larger ones under strict data constraints.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: compact models, strict data constraint
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 6239
Loading