Preprocessing Requirements Documents for Automatic UML Modelling

Martijn B. J. Schouten; Guus J. Ramackers; Suzan Verberne

Preprocessing Requirements Documents for Automatic UML Modelling

Martijn B. J. Schouten, Guus J. Ramackers, Suzan Verberne

Published: 01 Jan 2022, Last Modified: 15 Jun 2024NLDB 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Current approaches to natural language processing of requirements documents restrict their input to documents that are relevant to specific types of models only, such as domain- or process-focused models. Such input texts do not reflect real-world requirements documents. To address this issue, we propose a pipeline for preprocessing such requirements documents at the conceptual level, for subsequent automatic generation of class, activity, and use case models in the Unified Modelling Language (UML) downstream. Our pipeline consists of three steps. Firstly, we implement entity-based extractive summarization of the raw text to enable highlighting certain parts of the requirements that are of interest to the modelling goal. Secondly, we develop a rule-based bucketing method for selecting sentences into a range of ‘buckets’ for transformation into their corresponding UML models. Finally, to prove the effectiveness of supervised machine learning models on requirements texts, a sequence labelling model is applied to the text specific for class modelling to distinguish classes and attributes in the running text. In order to enable this step of our pipeline, we address the lack of available annotated data by labelling the widely used PURE requirements dataset on a word level by tagging classes and attributes within the texts. We validate our findings using this extended dataset.

Loading