Repo4QA: Answering Complex Coding Questions via Dense Retrieval on GitHub RepositoriesDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Open-source platforms such as Github and Stack Overflow both play important roles in our software ecosystem. It is crucial but time-consuming for programmers to raise their specific programming questions on coding forums such as Stack Overflow, which guides them to actual solutions on Github repositories. We show our interest in accelerating such a process and find that traditional Information Retrieval based methods fail to handle the long and complex questions in coding forums and thus cannot find the suitable coding repositories. In order to bridge the semantic gap between repositories and real-world coding questions effectively and efficiently, we introduce a specialized dataset named Repo4QA, which includes over 12,000 question-repository pairs constructed from Stack Overflow and Github. Furthermore, we propose QuReCL, a contrastive learning model based on CodeBERT, to jointly learn the representation of both questions and repositories. Experimental results demonstrate that our model can simultaneously capture the semantic features in both questions and repositories through jointly embedding, and outperforms existing state-of-art methods.
0 Replies

Loading