Family Matters: Cross-Provider LLM-Judge Committees and Iterative Prompts for Turkish PropBank Argument-Frame Prediction

Family Matters: Cross-Provider LLM-Judge Committees and Iterative Prompts for Turkish PropBank Argument-Frame Prediction

ACL ARR 2026 May Submission14711 Authors

26 May 2026 (modified: 13 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Propbank, LLM-as-Judge, semi-automated annotation, Turkish NLP

Abstract: Large language models are increasingly used not only to label data but also to judge whether another model's labels are correct. Using Turkish PropBank argument-frame prediction as a testbed, we study two questions. First, does agreement between LLM judges reflect shared capability or shared provider family? Across four judges from two provider families (two Gemini, two OpenAI), within-family agreement consistently exceeds cross-family agreement. However, a two-judge committee combining one model from each family improves exact-match precision by 11.9 points over the unfiltered baseline, more than doubling the gain of the best single-judge filter, while adding a third same-family judge yields little benefit. Second, how do targeted prompt revisions affect other error types? A prompt edit for Turkish intransitive activity verbs improves the intended ARG0-only class by 21.7 points but also increases ARG2 omissions by 30.9 points; a follow-up clarification partially recovers overall performance. These results suggest that LLM-judge committees benefit more from family diversity than larger committee size, and that prompt revisions for closed semantic taxonomies should be evaluated by error class rather than aggregate metrics alone.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, language resources, automatic creation and evaluation of language resources, NLP datasets, automatic evaluation of datasets, evaluation methodologies

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: Turkish

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 14711

Loading