Abstract: Benchmark frameworks and datasets allow us to analyze and understand the knowledge that NLP models capture about the world they were trained on. Several Transformer models have recently been adapted to code-related tasks such as code search, in which the goal is to find the most semantically relevant code given a query written in natural language. To achieve satisfactory performance, the retrieval models heavily rely on the quality of the query. In this paper, we introduce the Natural Language Code Search Robustness Benchmark (COBE), which provides a more holistic evaluation of the state-of-the-art models considering several aspects of the retrieval models, such as: (i) retrieval capabilities measured in multiple ranking metrics; (ii) robustness to a plethora of input perturbations; (iii) efficiency in terms of training and retrieval times; and (iv) stability across fine-tuning runs. We shed a light over important questions showing that simply computing performance-based retrieval metrics does not suffice to evaluate this kind of model. The proposed benchmark introduces novel metrics and measurement strategies that allow a rigorous quantitative analysis of input-query robustness while providing an understanding of model generalization behavior. We perform an extensive set of experiments using state-of-the-art models such as CodeBert, GraphCodeBert, and CodeT5. Those models are fine-tuned over many different scenarios in six programming languages. Several models trained in this study outperform their state-of-the-art counterparts, which provides evidence that the standard fine-tuning approach used in code search related work is sub-optimal. The proposed benchmark is a powerful tool to evaluate code search models, providing insights on how they behave during fine-tuning and how they are interpreting the input queries.
0 Replies
Loading