LaTeXBench: Judge-Only Evaluation of LaTeX Generation, Minimal-Edit Compliance, and Blind Contrast Errors

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LaTeX, benchmarking, LLM-as-judge, structured generation, minimal-edit compliance, fault detection, reproducibility, automatic grading, Wilson interval, document tooling
Abstract: Large language models are increasingly used to author scientific documents in LaTeX, where success depends on structural validity, precise constraint adherence, and fault awareness beyond surface fluency. We present LaTeXBench, a compact, judge-only benchmark targeting three structure-aware abilities: (1) Generation—produce syntactically valid LaTeX that satisfies explicit structural requirements; (2) Edit-Compliance—apply only requested edits while preserving unrelated content byte-for-byte; and (3) Blind Contrast—detect and classify a single seeded fault from a closed taxonomy. A single deterministic workbook provides 150 items (50 per family). Scoring is fully automatic via strict JSON outputs from an LLM judge, with Wilson binomial intervals to quantify small-n uncertainty. We release prompts, runners, seeds, and plotting scripts to support transparent replication. Evaluations across production models show high contrast detection and specificity, but notably lower obedience on minimal-edit tasks, underscoring structure-preserving editing as a key bottleneck. LaTeXBench offers an inexpensive, auditable base layer for measuring code-like behaviors in document tooling and guiding future models specialized for LaTeX structure and edits.
Submission Number: 238
Loading