Automatic Classification of Parental Behaviors in Bilingual Datasets from In-Person and Telehealth Language Assessment
Abstract: Conducting text-based behavioral coding is a labor-intensive process for clinicians, particularly when annotating complex bilingual data. This study evaluates the performance of four state-of-the-art (SOTA) large language models (LLMs) in automating the classification of parent behaviors within a bilingual dataset comprising 59 Mandarin-English child language assessment sessions (16 in-person and 43 telehealth). While the four LLMs — GPT-4, Llama-3, Qwen2, and DeepSeek-V3 — achieved notable accuracy, they still fall short of the performance of bilingual human annotators. Additional error analysis revealed that both human annotators and the generally best-performing model, GPT-4, faced challenges in classifying parental behaviors in categories involving complex task procedures, especially when analyzing bilingual code-mixed text. This study contributes to the understanding of how LLMs can be utilized to advance the automated classification of behavioral coding in bilingual child language assessments.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual benchmarks, multilingualism, multilingual evaluation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English, Mandarin
Submission Number: 6527
Loading