Safer Large Language Models via Hierarchical Meta-Learning Optimization

ACL ARR 2025 February Submission6959 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The performance of large language models (LLMs) is highly dependent on the way they interact with input data, where improper handling can lead to undesirable outcomes, including the exacerbation of biases and unsafe behaviors. Current optimization techniques often neglect the model’s underlying pre-training knowledge and treat each input independently, missing the potential for more efficient and safer learning. In this work, we present Learning to Safe Prompt (L2P), a novel approach that integrates hierarchical meta-learning with optimization strategies to enhance the safety and reliability of LLMs. L2P trains a model to adapt its responses through a meta-learning framework that prioritizes both performance and risk mitigation, ensuring that the model behaves safely across a wide range of inputs. Our extensive evaluation shows that L2P outperforms existing methods by significantly improving both the safety and effectiveness of LLM responses while maintaining high performance.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation, model bias/unfairness mitigation, ethical considerations in NLP applications;
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 6959
Loading