LLMs as Judges for Domain-Specific Text: Evidence from Drilling Reports

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, LLM-as-Judge, Domain-Specific Text Generation, Data-to-Text, Multi-Criteria Evaluation, Daily Drilling Reports, Reliability
TL;DR: LLMs can judge domain-specific text, but scale and evaluation design both matter. Large models with structured rubrics align best with experts, while smaller ones amplify errors.
Abstract: Large language models are now judged by other models in many workflows. This scales, but it is risky in domains where facts, numbers, and terminology matter. We study this in an industrial data-to-text setting: short, structured reports generated from time-series sensor data. The task is Daily Drilling Report (DDR) sentence generation, but the lessons apply to any domain-grounded pipeline. We evaluate LLMs used as judges under three protocols: a minimal single score, a weighted multi-criteria score, and a multi-criteria scheme with external aggregation. We compare model sizes and prompt designs using agreement metrics with human experts. Larger judges improve consistency, yet prompt and aggregation choices still cause large shifts in reliability and calibration. Smaller judges fail to track numeric and terminology constraints even with structure. The takeaways are practical. Good evaluation needs domain knowledge in the rubric, transparent aggregation, and stress tests that expose failure modes of LLM-as-judge. Our study offers a blueprint for building such evaluations in data-to-text applications and a caution against treating general-purpose judges as drop-in replacements for expert assessment.
Submission Number: 221
Loading