Urdu Semantic Parsing: An Improved SEMPRE Framework for Conversion of Urdu Language Web Queries to Logical Forms

Nafees Ahmad, Muhammad Aslam, Sana Shams, Ana María Martínez Enríquez

Published: 2023, Last Modified: 20 Feb 2025MCPR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The paper presents a semantic parser for Urdu language web queries about the journal’s dataset. It is built on the Sempre framework and trained using natural language paraphrases of logical queries from the Tehqeeqat domain. The parser uses linguistic information, including part of speech (POS), skip-bigrams, word alignment, and paraphrases. Sempre can be used on a new domain with zero training examples. In this methodology, First, we used domain-specific and domain-general grammars to generate canonical utterances paired with logical forms. The canonical utterances are meant to capture the meaning of logical forms, and logical forms cover the meaning of the set of compositional operators. These queries are then paraphrased carefully, keeping the semantics of the original query. The result data set containing utterances, paraphrases, and logical forms was used to train the semantic parser. Our semantic parser performs well on domain-specific queries and can be easily trained on a new domain.