Abstract: Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows to handle extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, not utilizing modern assisting tools like Large Language Models (LLM) to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), specifically tailored for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, along with novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics and evaluation of baseline models.
Paper Type: long
Research Area: Special Theme (conference specific)
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Polish
0 Replies
Loading