Beyond Lexical Similarity: A Benchmark for Evaluating Code Documentation Agents

Andrey Getmanov; Timofey Karyagin; Ulyanova Ekaterina Alekseevna; Adamenko Pavel; Ilya Sokolov; Pavel Zadorozhny; Rodion Levichev; Nikolay Nikitin; Dmitrii Babaev

Beyond Lexical Similarity: A Benchmark for Evaluating Code Documentation Agents

Andrey Getmanov, Timofey Karyagin, Ulyanova Ekaterina Alekseevna, Adamenko Pavel, Ilya Sokolov, Pavel Zadorozhny, Rodion Levichev, Nikolay Nikitin, Dmitrii Babaev

Published: 16 Jun 2026, Last Modified: 16 Jun 2026ICML 2026 Workshop DL4CEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: LLM, multi-agent systems, documentation generation, code benchmarking, programming language processing, doc-to-code, code-to-doc

TL;DR: We propose a benchmark that measures quality of auto-generated code documentation by using docstrings to evaluate downstream doc-to-code generation task via the pass@1 metric.

Abstract: The rise of automated documentation generation tools has created a critical need for objective evaluation methods. Current approaches to assessing documentation quality include lexical similarity metrics (e.g., BLEU) that compare generated text to human-written references, as well as more recent multi-faceted metrics that evaluate structural completeness or helpfulness. However, these methods do not directly measure whether documentation enables accurate code synthesis - a key aspect of its practical utility. This paper introduces a novel benchmark that overcomes this limitation by assessing documentation based on its utility in a downstream doc-to-code task. We propose a benchmark that automatically evaluates the quality of generated documentation at a repository level by assessing its utility for code regeneration using associated unit tests. To validate the proposed benchmark, we evaluate documentation produced by popular open-source agent systems. Our results show that the benchmark evaluates documentation quality differently from standard reference-based and reference-free metrics. The code of benchmark is available here: https://doi.org/10.5281/zenodo.19253895

Submission Number: 52

Loading