Scotch: A Semantic Code Search Engine for IDEsDownload PDF

04 Mar 2022, 03:33 (modified: 20 Apr 2022, 18:30)DL4C 2022Readers: Everyone
Keywords: Code Search, Machine Learning on Code, CodeBERT
TL;DR: A contextual and semantic code search tool that operates within an IDE.
Abstract: Code search is the task of finding relevant code snippets given a natural language query. In order to facilitate real time code search, we introduce Scotch, a semantic code search tool that runs within an IDE. The semantic nature of code search in Scotch allows us to leverage the semantic meaning of code via learned vector representations, while the in-IDE nature helps to improve developers' productivity by eliminating the need to navigate to web-browsers to search for code. The query used for code search is oftentimes ambiguous without the surrounding context of the search. In direct contrast to traditional search engines tailored to take a single line of input, the in-IDE nature of Scotch allows it to automatically infer code context during search and utilize it for search results. Hence, we propose the task `contextual code search' and present an analysis of how this code context can help improve the relevance of search results. Since no existing dataset could fit our task of contextual code search, we collect and contribute a dataset of about 19M functions from GitHub repositories with permissive licenses, which is the first large-scale dataset openly available for the task of contextual code search. We also present a manually-curated test set to assess the code ranking quality for code search in four programming languages. We finetune the CodeBERT model (Feng et al., 2020) to perform code search given a natural language query, with and without surrounding code context. Results from automated as well as human evaluation suggest that the inclusion of code context in search significantly improves the retrieval of the correct code snippet but slightly impairs ranking quality among code snippets. Our work provides motivation and resources for future research into contextual code search. Our code and models are available at
1 Reply