pydra: Probing Code Representations With Synthetic Clones and Bugs

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Sets or Data Repositories, Embedding Approach
TL;DR: synthetically generated clones and bugs for python dataset + empirical analysis of code embedding models
Abstract: We introduce \texttt{pydra}: an open-source dataset of $\sim$9k Python examples with synthetic clones and buggy variants for each. Our augmentation pipeline generates both semantics-preserving and bug-injecting code variants via AST transforms and stores rich metadata for analysis. Using \texttt{pydra}, we probe state-of-the-art code embedding models and find a stark limitation in their ability to rank correct variants above incorrect ones. Our analysis suggests that embeddings remain dominated by token overlap and code length rather than true program semantics. We hope that \texttt{pydra} serves the research community by filling several gaps in the Python code dataset ecosystem as well as providing a general tool for training and evaluating code embedding models.
Submission Number: 78
Loading