TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Yuhe Jiang; Xun Deng; Jiacheng Yang; Honghua Dong; Gennady Pekhimenko; Fan Long; Xujie Si

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Yuhe Jiang, Xun Deng, Jiacheng Yang, Honghua Dong, Gennady Pekhimenko, Fan Long, Xujie Si

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Benchmark, Type Inference, Untyped Python Repo, Large Language Models, Long-Context, Repo-Level, Software Engineering

TL;DR: Introduce TypyBench with new metrics TypeSim and TypeCheck, reveals LLMs can predict types for untyped repository but struggle with repo-level type consistency.

Abstract: Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 42

Loading