ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities

ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities

ACL ARR 2025 February Submission8078 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating Large Language Models (LLMs) requires effective methods to assess how well they maintain semantic consistency across multiple transformations. Traditional methods like self-consistency often fail to capture subtle semantic errors that emerge during multi-step tasks. To address this, we introduce ConsistencyChecker, a benchmark-free evaluation framework that evaluates LLMs' ability to preserve semantic consistency during multi-step processes. Our approach is based on the concept of a self-consistency tree, where each node represents a state of the text after a transformation (e.g., translation, code modification, paraphrasing), and each edge represents the transformation itself. By constructing self-consistency trees, we measure how accurately the model maintains the original meaning across these changes. ConsistencyChecker quantifies an LLM's reliability in retaining critical information by analyzing semantic preservation between nodes at different tree depths. This approach provides insights into model generalization capabilities without requiring extensive resources. Experiments show that ConsistencyChecker can accurately measure the generalization ability of models with various sizes on translation and coding tasks without the need for building benchmarks. By identifying scenarios where models maintain or lose semantic fidelity, ConsistencyChecker offers a practical tool for understanding LLM robustness in real-world applications.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Interpretability and Analysis of Models for NLP,Resources and Evaluation,Machine Learning for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English,German,Spanish,French

Submission Number: 8078

Loading