PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation

PictoEduca: Building a Dataset for Spanish Text-to-Pictogram Generation

ACL ARR 2026 January Submission4210 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-pictogram generation, dataset building, ARASAAC

Abstract: We present PictoEduca, the first large-scale Spanish text-to-pictogram dataset for augmentative and alternative communication (AAC), derived from primary educational materials and grounded in the ARASAAC pictogram repository. The dataset is released with a reproducible pipeline that combines automatic annotation with targeted expert correction, supporting scalable and high-quality corpus construction. We benchmark a rule-based system (ARAWORD) and neural models (T5, LLaMA) under direct text-to-pictogram and two-stage text-to-concept-to-pictogram settings. Results show that the rule-based system remains a strong baseline, while neural models benefit from explicit semantic abstraction, with the two-stage approach improving semantic coherence and reducing ambiguity. We further explore data selection strategies, demonstrating that combining domain similarity with a quality signal yields higher-quality silver data, reduces annotation effort, and improves model performance in low-resource regimes. PictoEduca enables reproducible evaluation and advances Spanish text-to-pictogram research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, language resources, automatic creation and evaluation of language resources

Contribution Types: Data resources

Languages Studied: Spanish

Submission Number: 4210

Loading