Language Models as Simulations of Early Language Acquisition: analysis of expressive vocabulary

ACL ARR 2024 June Submission5610 Authors

16 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have been shown to develop linguistic competence from mere exposure to language content, making them a promising avenue for investigating infants' language learning processes \citep{lavechin2023babyslm,chang2022word}. Nevertheless, LLMs typically require orders of magnitude more data than children, and language outcomes cannot be directly compared. Here, we introduce \textit{machine-CDI}, a metric based on the learner's output to enable a direct comparison of machines and infants on their expressive vocabulary as a function of input quantity. This metric adapts the Communicative Development Inventories \citep{fenson2007macarthur,frank2017wordbank}, a normalized inventory of words to quantify child language development, to the evaluation set of language models. We illustrate machine-CDI by comparing the expressive vocabulary in infants and character language models (LSTMs and Transformers) trained on English audiobooks. The results show that language models approximately match the children's learning curves, although Transformers are delayed compared to LSTMs. A further analysis show that the models are more impacted by word frequency than children, with a large delay in acquiring low frequency words for models. This delay is found to be linked to the more general phenomenon of long tail truncation observed in language models, which makes them unable to learn words based on few shot observations. These results shed new light on the principles of language acquisition, and highlights important divergences in how humans and modern algorithms learn to process natural language.
Paper Type: Long
Research Area: Generation
Research Area Keywords: cognitive modeling; computational psycholinguistics
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5610
Loading