Keywords: Large Language Models, alignment, human values, safety standards, robustness, model evaluation
TL;DR: This paper introduces MEAL, a framework for systematically evaluating and comparing different alignment techniques for Large Language Models across multiple dimensions like awareness, efficiency, quality, and robustness.
Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values, organizational norms, and safety standards has become a central pursuit in machine learning. The field has developed diverse alignment approaches including traditional fine-tuning methods (e.g., RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these techniques to guide implementation and deployment decisions. This paper introduces MEAL: A Multi-dimensional Evaluation of ALignment Techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across these major alignment techniques. This framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. To demonstrate the utility of this framework, we run a series of experiments across diverse base models and alignment techniques. This paper describes these experiments and their results and concludes by identifying the strengths and limitations of current state-of-the-art models and providing valuable insights as to the trade-offs among these alignment techniques.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13138
Loading