From Pseudo-Code to Source Code: A Self-Supervised Search Approach

Adithya Kulkarni; Mohna Chakraborty; Yonas Afewerki Sium; Sai Charishma Valluri; Wei Le; Qi Li

From Pseudo-Code to Source Code: A Self-Supervised Search Approach

Adithya Kulkarni, Mohna Chakraborty, Yonas Afewerki Sium, Sai Charishma Valluri, Wei Le, Qi Li

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Code Search, Self-Supervised Learning

TL;DR: The goal of the paper is to explore the feasibility of finding algorithm implementations from code

Abstract: Identifying algorithm implementations in source code is crucial for code comprehension, reference retrieval, and program synthesis. This paper presents PC2SC, a novel framework for mapping pseudo-code to source code without manual annotations. We introduce p-language, a structured representation that encodes control flow, mathematical expressions, and natural language descriptions of algorithms. A static analyzer extracts these features, converting pseudo-code into p-code, then embedded into a shared vector space with source code using self-supervised learning for retrieval. Given pseudo-code as input, PC2SC returns a ranked list of matching code snippets. Evaluations on the Stony Brook Algorithm Repository and GitHub projects demonstrate that PC2SC outperforms state-of-the-art code search tools in both C and Java. It successfully retrieves correct implementations within the top 25, 10, and 1 ranked results for 98.5\%, 93.8\%, and 66.2\% of queries, respectively. In GitHub projects, it identified 74 algorithm implementations out of 87 queries. PC2SC bridges the gap between algorithmic descriptions and executable implementations, offering a scalable, language-independent solution for algorithm retrieval and paving the way for future advancements in cross-language code search and automated synthesis.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 57

Loading