A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

ACL ARR 2026 January Submission6795 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: semantic relation dataset generation, turkish semantic relations corpus, low-resource language resources, LLM-augmented data synthesis, Hybrid data creation protocol

Abstract: We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms)—representing a 10× scale increase over existing resources at minimal cost (\$65). We validate the dataset through two downstream tasks: an embedding model achieving 90\% top-1 retrieval accuracy and a classification model attaining 90\% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: language resources, corpus/dataset creation, lexicon creation, evaluation methodology, lexical relations, synonymy/antonymy/co-hyponymy

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Turkish

Submission Number: 6795

Loading