MetaSim: A Search Engine for Finding Similar GitHub Repositories

Published: 01 Jan 2024, Last Modified: 20 May 2025ICSME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: How can we find other repositories on GitHub that are functionally similar to a specific repository? While GitHub offers keyword-based search functionality, there is a lack of a tool that can perform query by example to search and compare functionally similar repositories. To address this challenge, we present MetaSim: a search engine that finds similar GitHub repositories based on repository metadata features. MetaSim employs a customized technique to represent repository metadata in the embedding space for efficient indexing and searching. We construct a curated dataset of 267.6K public GitHub repositories to support our search engine. We evaluate our tool through a manual assessment on a set of 202 query by example repository and their corresponding matching pairs. Experiment results demonstrate that Readme alone can achieve high similarity precision (90.1%), which we define later. In contrast, the combined usage of Description, Topics, and Readme yields the best overall performance with similarity precision of 97.8%. To foster both research and practical applications, we open source our research artifacts through the MetaSim platform at https://metasim-app.github.io. The demonstration video of MetaSim is available at https://youtu.be/HnFnN3JclQw.
Loading