TL;DR: We introduce TypyBench to evaluate LLMs' Python type inference, with new metrics for type similarity and consistency. While LLMs achieve good performance on individual types, most of them struggle with complex types and cross-file consistency.
Abstract: Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at \href{https://github.com/typybench/typybench}.
Lay Summary: Figuring out the specific data types used in flexible programming languages like Python can be a real headache for software developers. While the powerful AI models known as LLMs are good at understanding code, we didn't know how well they could handle this specific task on a large scale.
To find out, we created TypyBench, a new test to see how accurately these AIs can predict data types across entire software projects. We developed two new ways to measure their performance: one that checks if the predicted type is close in meaning to the correct one, and another that verifies if the AI's predictions are consistent throughout the code.
Our tests on 50 high-quality Python projects revealed that while the AIs are pretty good at guessing the general meaning of types, they often make mistakes with more complicated ones and create inconsistencies within the same project. This shows that future efforts should focus on making AI predictions more consistent, and TypyBench provides the perfect tool to guide this research.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/typybench/typybench
Primary Area: General Machine Learning->Evaluation
Keywords: Benchmark, Type Inference, Untyped Python Repo, Large Language Models, Long-Context, Repo-Level, Software Engineering
Submission Number: 14734
Loading