Keywords: LLM, multi-agent systems, documentation generation, code benchmarking, programming language processing, doc-to-code, code-to-doc
TL;DR: We propose a benchmark that measures quality of auto-generated code documentation by using docstrings to evaluate downstream doc-to-code generation task via the pass@1 metric.
Abstract: The rise of automated documentation generation tools has created a critical need for objective evaluation methods. Current approaches to assessing documentation quality include lexical similarity metrics (e.g., BLEU) that compare generated text to human-written references, as well as more recent multi-faceted metrics that evaluate structural completeness or helpfulness. However, these methods do not directly measure whether documentation enables accurate code synthesis - a key aspect of its practical utility. This paper introduces a novel benchmark that overcomes this limitation by assessing documentation based on its utility in a downstream doc-to-code task. We propose a benchmark that automatically evaluates the quality of generated documentation at a repository level by assessing its utility for code regeneration using associated unit tests. To validate the proposed benchmark, we evaluate documentation produced by popular open-source agent systems. Our results show that the benchmark evaluates documentation quality differently from standard reference-based and reference-free metrics. The code of benchmark is available here: https://doi.org/10.5281/zenodo.19253895
Submission Number: 52
Loading