Keywords: Foundation Models, Information Theory, Alignment, AI Safety, Trustworthy ML
TL;DR: We implement an information elicitation mechanism via an LLM, establish truthfulness guarantees, and show promising preliminary results on machine translation and peer-review tasks.
Abstract: As language models become increasingly sophisticated, ensuring the faithfulness of their outputs to the input and the consistency of their reasoning across outputs is a critical challenge. To address the scalability issues in overseeing these aspects, we propose a novel approach based on information-theoretic measures for detecting manipulated or unfaithful reasoning. We propose a Difference of Entropies (DoE) estimator to quantify the difference in mutual information between outputs, providing a principled way to identify low-quality or inconsistent content. We theoretically analyze the DoE estimator, proving its incentive-compatibility properties and deriving scaling laws for f-mutual information as a function of sample size. Motivated by the theory, we implement the estimator using an LLM on a simple machine translation task and a dataset of peer reviews from ICLR 2023, considering various manipulation types. Across these scenarios, the DoE estimator consistently assigns higher scores to unmodified reviews compared to manipulated ones and correlates with BLEU, demonstrating its effectiveness in ensuring the reliability of language model reasoning. These results highlight the potential of information-theoretic approaches for scalable oversight of advanced AI systems.
Submission Number: 36
Loading