Abstract: Evaluating Large Language Models (LLMs) requires effective methods to assess how well they maintain semantic consistency across multiple transformations. Traditional methods like self-consistency often fail to capture subtle semantic errors that emerge during multi-step tasks. To address this, we introduce ConsistencyChecker, a benchmark-free evaluation framework that evaluates LLMs' ability to preserve semantic consistency during multi-step processes. Our approach is based on the concept of a self-consistency tree, where each node represents a state of the text after a transformation (e.g., translation, code modification, paraphrasing), and each edge represents the transformation itself. By constructing self-consistency trees, we measure how accurately the model maintains the original meaning across these changes. ConsistencyChecker quantifies an LLM's reliability in retaining critical information by analyzing semantic preservation between nodes at different tree depths. This approach provides insights into model generalization capabilities without requiring extensive resources. Experiments show that ConsistencyChecker can accurately measure the generalization ability of models with various sizes on translation and coding tasks without the need for building benchmarks. By identifying scenarios where models maintain or lose semantic fidelity, ConsistencyChecker offers a practical tool for understanding LLM robustness in real-world applications.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Interpretability and Analysis of Models for NLP,Resources and Evaluation,Machine Learning for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English,German,Spanish,French
Submission Number: 8078
Loading