Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Zijun Liu; Boqun Kou; Peng Li; Ming Yan; Ji Zhang; Fei Huang; Yang Liu

Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Response Judgement, LLM Reliability, Weak Language Model, Model Cascading, Data Selection

TL;DR: We developed an practically effective and efficient few-shot method, Meta Ranking, to judge the reliability of LLM responses effectively with weak language models.

Abstract: Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called $\textit{Meta Ranking}$ (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwise ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, that MR with weaker LLMs, which have lower task performance, results in higher judgement precision against baselines with the same or even stronger models. Moreover, the method requires as few as five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass 13B models with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4851

Loading