Attention-Driven Clustering of Programming Problem Difficulty Using Statistical and Machine Learning Techniques

Md. Shahajada Mia, Yutaka Watanobe, Md. Mostafizer Rahman, Md Faizul Ibne Amin, Daniel M. Muepu

Published: 01 Jan 2025, Last Modified: 25 May 2026IEEE AccessEveryoneRevisionsCC BY-SA 4.0

Abstract: Programming is an important skill for advancement in computer science and information and communication technology (ICT). Online judge (OJ) systems are increasingly popular platforms in programming education to enhance users’ programming skills through practice. However, in most OJ systems, the problems are usually listed in a simple format without clear identification by difficulty levels. This research aims to cluster problem difficulty effectively and analyze user performance. We extract features from the Aizu OJ (AOJ) submission log data and estimate user ability and problem difficulty scores using logistic regression (LR). We propose an approach that combines an attention mechanism (AM) with K-means applied to statistically selected features via pairwise correlation to categorize the problems. Compared to standard K-means and Fuzzy C-means (FCM) with their AM integration on full and reduced feature sets by principal component analysis (PCA), autoencoder (AE), and variance inflation factor (VIF), the proposed method demonstrates the best clustering performance. It achieves the highest silhouette score (SS) of 0.507, Calinski-Harabasz index (CHI) of 6581.088, and a comparatively good Davis Bouldin index (DBI) of 0.625. We found a strong correlation between user ability and problem-solving rates; beginners and intermediate users consistently struggle with solving more complex problems. To support this, web-based clustering and group-based recommendations have been developed that visualize problem difficulty clusters, enabling users to understand difficulty levels and select problems based on their ability level. Moreover, the proposed approach is promising for effective clustering of more complex educational datasets and improving learning outcomes in programming education.

External IDs:doi:10.1109/access.2025.3606125