A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus
Keywords: semantic relation dataset generation, turkish semantic relations corpus, low-resource language resources, LLM-augmented data synthesis, Hybrid data creation protocol
Abstract: We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms)—representing a 10× scale increase over existing resources at minimal cost (\$65). We validate the dataset through two downstream tasks: an embedding model achieving 90\% top-1 retrieval accuracy and a classification model attaining 90\% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: language resources, corpus/dataset creation, lexicon creation, evaluation methodology, lexical relations, synonymy/antonymy/co-hyponymy
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Turkish
Submission Number: 6795
Loading