DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

ICLR 2026 Conference Submission12880 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: translation, discourse-level, expert-level, benchmark, LLM, automatic evaluation system
TL;DR: We introduce DiscoX, a benchmark for the evaluation of LLMs on discourse- and expert-level translation tasks. We also propose Metric-S, an automatic evaluation system for translation tasks.
Abstract: The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation. Our data and code are available at https://anonymous.4open.science/r/DiscoX-5F18.
Primary Area: datasets and benchmarks
Submission Number: 12880
Loading