Family Matters: Cross-Provider LLM-Judge Committees and Iterative Prompts for Turkish PropBank Argument-Frame Prediction
Keywords: Propbank, LLM-as-Judge, semi-automated annotation, Turkish NLP
Abstract: Large language models are increasingly used not only to label data but
also to judge whether another model's labels are correct. Using Turkish
PropBank argument-frame prediction as a testbed, we study two questions.
First, does agreement between LLM judges reflect shared capability or
shared provider family? Across four judges from two provider families
(two Gemini, two OpenAI), within-family agreement consistently exceeds
cross-family agreement. However, a two-judge committee combining one
model from each family improves exact-match precision by 11.9 points
over the unfiltered baseline, more than doubling the gain of the best
single-judge filter, while adding a third same-family judge yields
little benefit. Second, how do targeted prompt revisions affect other
error types? A prompt edit for Turkish intransitive activity verbs
improves the intended ARG0-only class by 21.7 points but also increases
ARG2 omissions by 30.9 points; a follow-up clarification partially
recovers overall performance. These results suggest that LLM-judge
committees benefit more from family diversity than larger committee
size, and that prompt revisions for closed semantic taxonomies should
be evaluated by error class rather than aggregate metrics alone.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, language resources, automatic creation and evaluation of language resources, NLP datasets, automatic evaluation of datasets, evaluation methodologies
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Turkish
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 14711
Loading