Track: long paper (up to 9 pages)
Keywords: Code Search, Self-Supervised Learning
TL;DR: The goal of the paper is to explore the feasibility of finding algorithm implementations from code
Abstract: Identifying algorithm implementations in source code is crucial for code comprehension, reference retrieval, and program synthesis. This paper presents PC2SC, a novel framework for mapping pseudo-code to source code without manual annotations. We introduce p-language, a structured representation that encodes control flow, mathematical expressions, and natural language descriptions of algorithms. A static analyzer extracts these features, converting pseudo-code into p-code, then embedded into a shared vector space with source code using self-supervised learning for retrieval.
Given pseudo-code as input, PC2SC returns a ranked list of matching code snippets. Evaluations on the Stony Brook Algorithm Repository and GitHub projects demonstrate that PC2SC outperforms state-of-the-art code search tools in both C and Java. It successfully retrieves correct implementations within the top 25, 10, and 1 ranked results for 98.5\%, 93.8\%, and 66.2\% of queries, respectively. In GitHub projects, it identified 74 algorithm implementations out of 87 queries.
PC2SC bridges the gap between algorithmic descriptions and executable implementations, offering a scalable, language-independent solution for algorithm retrieval and paving the way for future advancements in cross-language code search and automated synthesis.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Adithya_Kulkarni1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 57
Loading