Abstract: The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, potentially ignoring the superior generation capabilities of contemporary LLMs. To investigate the robustness of LLMs while using their generation ability, we propose a novel rational evaluation pipeline that leverages reward models as diagnostic tools to evaluate the long conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters.Our extensive empirical experiments demonstrate that TREvaL provides an identification for the lack of robustness of nowadays LLMs.Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during alignment process.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=BMKJEGNMcZ&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: 1. We filter the evaluation attacked prompts with GPT-4 to make sure the semantic do not change after different level and type perturbations.
2. We introduce a new reward model : ArmoR-Llama 3-8B-v0.1 which is aligned with human preference.
3. We introduce a new evaluation set: Alpagasus-9k.
4. We give a small human study to identify if the reward model's scores are aligned with human judgements.
5. We recorrect some typos.
Assigned Action Editor: ~Xuming_He3
Submission Number: 3228
Loading