Algorithmic Oversight for Deceptive Reasoning

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: oversight, deceptive reasoning, LLMs, alignment
TL;DR: This paper studies the oversight problem in the context of LLMs and mathematical reasoning problems.
Abstract: This paper investigates the oversight problem where a large language model (LLM) provides output that may contain deliberate adversarial errors and an oversight LLM/agent aims to detect them. We study this question in the context of mathematical reasoning. Our study is conducted in two primary steps: Firstly, we develop attack strategies aimed at inducing deliberate reasoning errors that could deceive the oversight agent. Here, we find that even strong models can be deceived, which highlights a need for defense mechanisms. Secondly, we propose a list of defense mechanisms to protect against these attacks by augmenting oversight capabilities. Through these, we find that structured prompting, fine-tuning, and greybox access can noticeably improve detection accuracy. Specifically, we introduce ProbShift, a novel algorithm utilizing token-probabilities of the generated text for detection. We find that ProbShift can outperform GPT-3.5 and can be boosted with LLM-based oversight. Overall, this work demonstrates the feasibility and importance of developing algorithmic oversight mechanisms for LLMs, with emphasis on complex tasks requiring logical/mathematical reasoning.
Serve As Reviewer: egetaga@umich.edu
Submission Number: 18
Loading