Multilingual Program Code Classification Using $n$-Layered Bi-LSTM Model With Optimized Hyperparameters

Md. Mostafizer Rahman; Yutaka Watanobe

Multilingual Program Code Classification Using $n$-Layered Bi-LSTM Model With Optimized Hyperparameters

Md. Mostafizer Rahman, Yutaka Watanobe

Published: 01 Jan 2024, Last Modified: 19 Feb 2025IEEE Trans. Emerg. Top. Comput. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Programmers are allowed to solve problems using multiple programming languages, resulting in the accumulation of a huge number of multilingual solution codes. Consequently, identifying codes from this vast archive of multilingual codes is a challenging and non-trivial task. Considering the codes' complexity compared to natural languages, conventional language models have had limited success. Deep neural network models have achieved state-of-the-art performance in programming-related tasks. However, the multilingual code classification based on the problem name or algorithm remains an open problem. This paper presents a novel multilingual program code classification model for the code classification task based on algorithms and problem names. First, a layered bidirectional long short-term memory model is designed to better understand the complex code context. Second, preprocessing, tokenization, and encoding processes are performed on real-life datasets. Next, clean and trainable formatted data are prepared. Finally, experiments are conducted on real-life datasets (e.g., sorting, searching, graphs and trees, numerical computations, basic data structures, and their combinations) with optimized hyperparameter settings. The results show that the proposed model can effectively improve the code classification accuracy compared to other baseline models.

Loading