MultiTrust: Enhancing Safety and Trustworthiness of Large Language Models from Multiple Perspectives
Keywords: Large Language Models, Safety, Trustworthiness, Robustness, Fairness, Truthfulness
TL;DR: We introduce MultiTrust, a novel framework that improves the safety and trustworthiness of large language models from multiple trustworthiness perspectives by leveraging challenging data generation, safe model learning, and safe model augmentation.
Abstract: Large Language Models (LLMs) have shown impressive performance across various tasks, yet they still face significant safety and trustworthiness challenges, such as robustness, fairness, and truthfulness. Addressing these challenges is critical for the reliable deployment of LLMs. Directly fine-tuning LLMs to enhance safety can degrade their performance and is challenging to balance across multiple safety perspectives due to the forgetting phenomenon. In this paper, we propose MultiTrust, a novel and scalable framework designed to enhance LLM safety from multiple safety perspectives. In particular, MultiTrust first generates challenging training data through adversarial optimizations, focusing on LLMs trustworthiness perspectives, such as robustness, fairness, and safety. MultiTrust then separately train safety auxiliary models for each perspective using supervised fine-tuning and Direct Preference Optimization (DPO). MultiTrust augments a base model with these safety auxiliary models on the fly through dynamic routing and logit ensembling, significantly boosting the performance across different trustworthiness metrics for the base model while preserving its helpfulness. Notably, MultiTrust introduces an effective perplexity-based inference-time router to seamlessly integrate these safety auxiliary models by averaging the logit outputs of the selected safety auxiliary model and the base model, which enhances the stability of the final performance. Moreover, MultiTrust's flexible design allows for the augmentation with new safety auxiliary models for different perspectives without necessitating additional training or adaptation. Extensive experimental results show that MultiTrust, which trains a series of 7B safety auxiliary models, significantly improves the trustworthiness of the base LLM across different sizes (7B and 13B). For instance, MultiTrust increased the average performance of Llama2-13B from 35.54% to 51.14%, and Vicuna-13B from 29.91% to 52.82%, outperforming models with similar and even larger sizes across different perspectives. These results underscore the effectiveness and scalability of MultiTrust in enhancing the safety and reliability of LLMs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9173
Loading