Keywords: Large language models, evaluation, alignment, safety standards, robustness, human value
Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values, organizational norms, and safety standards has become a central pursuit in machine learning. The field has developed diverse alignment approaches including traditional fine-tuning methods (e.g., RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these techniques to guide implementation and deployment decisions. This paper introduces MEAL: A Multi-dimensional Evaluation of ALignment Techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across major alignment techniques. This framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. To demonstrate the utility of this framework, we run a series of experiments across diverse base models and alignment techniques. This paper describes these experiments and their results and concludes by identifying the strengths and limitations of current state-of-the-art models and providing valuable insights as to the trade-offs among these alignment techniques.
Submission Number: 114
Loading