Limited Linguistic Diversity in Embodied AI Datasets

ACL ARR 2025 May Submission5166 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language is an important component of Vision-Language-Action (VLA) models, but the linguistic quality of training and test data remains underexplored. We analyze language in several VLA datasets and find that it is highly repetitive and structurally simple. These findings highlight the need for more diverse and linguistically rich data to support robust language understanding in embodied settings.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: evaluation of datasets
Contribution Types: Data analysis, Position papers
Languages Studied: English
Submission Number: 5166
Loading