Abstract: This study introduces ``12-Angry LLMs,'' a novel annotation and classification model that leverages annotator disagreement to improve complex stance detection. Departing from traditional methods that average out divergence, we deploy a diverse panel of 12 LLMs that engage in a two-stage process: independent voting (Round A) followed by collective deliberation (Round B) when disagreement occurs. We demonstrate that the rationales generated during deliberation serve as critical signals for fine-tuning the Judge model. On the RUStance-2023 dataset, this Judge model achieves performance (F1 $\approx$ 0.81) compared with single-model baselines and standard aggregations. The approach also proves highly transferable, achieving an F1 score of 0.94 on the out-of-domain PStance dataset using few-shot prompting with jury rationale. We contribute a new dataset containing expert labels alongside full jury deliberation traces, establishing a paradigm in which model divergence is utilized as a diagnostic tool for uncertainty and interpretability rather than noise.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=uwHCduswba¬eId=uwHCduswba
Changes Since Last Submission: Dear Editor-in-Chief,
Journal of Transactions on Machine Learning Research (TMLR)
Thank you for carefully reading our paper and for the constructive feedback. We appreciate the reviewer’s concerns regarding the experimental evaluation, statistical testing, failure analysis, and comparative study. In response to these comments, we have substantially revised the paper and added new experiments and analyses. Below, we address each concern.
1. Limited experiments on only one dataset
In the revised version of the paper, we expanded the experiments to include three additional stance detection datasets: PStance, SemEval-2016 Task 6, and GWStance. All datasets were processed using the same jury-deliberation pipeline and evaluation setup to ensure fair comparison. The results show that the proposed jury-based model generalizes across multiple datasets and domains, not only RUStance-2023.
2. Lack of statistical significance testing
In the revised paper, we added statistical significance testing using paired bootstrap resampling on Macro-F1 scores across different model configurations. The results show that the improvements from few-shot learning with jury rationales over zero-shot prompting are statistically significant, and the fine-tuned Judge model significantly outperforms the few-shot configuration. This confirms that the performance improvements are not due to random variation.
3. Limited failure analysis
We have added a new Failure Analysis section and a detailed qualitative analysis in the appendix. We analyzed misclassified examples and grouped errors into common categories, including implicit stance, sarcasm and irony, mixed stance (multiple targets), target confusion, ambiguity boundary cases, and context-dependent interpretation. This analysis indicates that many errors are caused by genuine ambiguity and subjective interpretation rather than simple model errors, which supports the paper's main idea that disagreement between models reflects uncertainty in subjective tasks.
4. Missing comparison with other multi-LLM methods
To address this issue, we added a self-consistency baseline, a widely used method that samples multiple outputs from the same model and aggregates them using majority voting. We tested self-consistency across different sample sizes and compared it with our multi-LLM jury approach. The results indicate that self-consistency improves performance over Zero-shot LLM text-only, but the multi-LLM jury performs better overall. This comparison shows the difference between sampling diversity (self-consistency) and model diversity (multi-LLM jury), which demonstrates the advantage of using multiple different models rather than multiple samples from the same model.
In summary, the revised paper now includes the following:
- Experiments on four datasets instead of one.
- Statistical significance testing.
- A new section for failure analysis and qualitative error analysis.
- A self-consistency baseline is used as a comparison to another multi-LLM method.
We thank the reviewer again for the valuable feedback. The suggested revisions significantly improved the quality of the paper, and we hope the revised version addresses the concerns and can be reconsidered for publication.
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 9365
Loading