Constructing A Human-Centered Dataset for Second Language Acquisition via Large Language Models

Yunlong Fan, Baixuan Li, Zhiheng Yang, Zhiqiang Gao

Published: 01 Jan 2024, Last Modified: 18 May 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models (LLMs) achieve advanced capabilities in natural language processing (NLP) through pre-training on massive data, suggesting that LLMs have great potential to enhance human language learning. In second language (L2) acquisition, complex sentences pose significant challenges for learners due to limited working memory, and mastering grammar is crucial for overcoming these difficulties. However, syntactic parsing datasets utilized in NLP tasks are typically designed for computational purposes and are not ideal for pedagogical use. In this work, we propose a pipeline leveraging LLMs to construct a human-centered dataset for L2 acquisition, where complex sentences are visualized into a hierarchical tree diagram built from decomposed short sentences (i.e., clauses) within an interactive interface. Specifically, we employ a Least-to-Most (LtM) prompting strategy with seven mainstream LLMs to convert complex sentences into clause hierarchy with an F1 score of 0.8–0.9. Using this pipeline, we have constructed a "sentence-to-clausal-tree" dataset comprising 39,422 complex sentences from newswires and weblogs. Additionally, we offer a web-based application for visualization and user interaction. In a trial involving ten L2 learners, our web application has been highly acknowledged for enhancing comprehension and memory retention of complex English sentences.