Keywords: Open Source Software, Issue Classification, Random Forest
Abstract: Open Source Software (OSS) and its accompanying communities are a valuable environment for university students to gain real-world collaborative development experience. At the same time, tasks are often too complex and poorly scoped, making it challenging to identify tasks suitable for students looking to contribute to these communities. Our goal is to automate the discovery of OSS issues that are both educationally valuable and technically feasible for university-level coursework. We compare the performance of a supervised Machine Learning (ML) classifier and a Large Language Model (LLM) baseline, both trained on a labeled dataset of GitHub issues, for flagging issues as "Candidates for University Projects" and identifying their characteristics. Despite extreme class imbalance (1.6\% positive rate), the Random Forest classifier was able to identify $\sim$45\% (5 out of 11) of the JabRef "Candidate University Projects" (out of 30 issues recommended) tagged in the repository. The project maintainer reviewed the top-30 issues identified by the classifier and identified another 13 candidates, bringing the total to 18 (with k=30; 60\% precision). This demonstrates the model's practical utility in supporting human triage. The LLM baseline achieved low recall and precision, with limited effectiveness compared to the supervised learning approach. Our study provides insights for educators, students, and OSS maintainers seeking to streamline the identification of academic project tasks. It also suggests that lightweight models can uncover valuable tasks even in noisy, under-annotated repositories, pointing toward a scalable triage process.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: true
Submission Number: 60
Loading