Abstract: Large language models (LLMs) are applied to diverse contexts of our lives, including the implementation in child education. Here, we evaluate the ability of an LLM to generate child-like language by comparing an LLM-based corpus to the Litkey Corpus, a collection of German children's writings based on picture stories. We generated a parallel LLM-based corpus using identical visual prompts and conducted a comparative analysis across word frequency distributions, lexical richness, and semantic representations. This study aims to explore if and how children and LLMs differ in psycholinguistic aspects of the text to evaluate the potential influences of LLM text on child development. The results show that, while the LLM-based texts are longer, the vocabulary is less rich, has more letters, and misses words in medium- and low-frequency ranges (i.e., uses primarily words that often occur). Additionally, vector space analysis using semantic word embeddings reveals a low semantic similarity, highlighting differences between the two corpora on the level of corpus semantics. These findings contribute to our understanding of LLM-generated language and its limitations in modeling child language, with implications for LLM usage in psycholinguistics and educational applications.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics
Contribution Types: Model analysis & interpretability
Languages Studied: German
Submission Number: 3889
Loading