Deriving de/het gender classification for Dutch nouns for rule-based MT generation tasks

Bogdan Babych, Jonathan Geiger, Mireia Ginestí Rosell, Kurt Eberle

16 Feb 2022 (modified: 16 Feb 2022)OpenReview Archive Direct UploadReaders: Everyone

Abstract: Linguistic resources available in the public domain, such as lemmatisers, part-ofspeech taggers and parsers can be used for the development of MT systems: as separate processing modules or as annotation tools for the training corpus. For SMT this annotation is used for training factored models, and for the rule-based systems linguistically annotated corpus is the basis for creating analysis, generation and transfer dictionaries from corpora. However, the annotation in many cases is insufficient for rule-based MT, especially for the generation tasks. In this paper we analyze a specific case when the part-ofspeech tagger does not provide information about de/het gender of Dutch nouns that is needed for our rule-based MT systems translating into Dutch. We show that this information can be derived from large annotated monolingual corpora using a set of context-checking rules on the basis of co-occurrence of nouns and determiners in certain morphosyntactic configurations. As not all contexts are sufficient for disambiguation, we evaluate the coverage and the accuracy of our method for different frequency thresholds in the news corpora. Further we discuss possible generalization of our method, and using it to automatically derive other types of linguistic information needed for rule-based MT: syntactic subcategorization frames, feature agreement rules and contextually appropriate collocates.

0 Replies