Content-free Logical Modification of Large Language Model by Disentangling and Modifying Logic Representation
Abstract: Despite extensive training on diverse datasets and alignment with human values, large language models (LLMs) can still generate fallacious outputs. Additionally, the validity of LLM's outputs varies significantly depending on the content. It is crucial to ensure LLMs' logical consistency across different contexts. Drawing inspiration from cognitive psychology studies, we propose a Logic Control Framework (LCF) that disentangles LLMs' hidden representations into separate content and logic spaces. Within the logic space, we use logically valid and invalid samples to construct distinct regions through contrastive learning. By moving logic representations to logically valid regions and fusing them with unchanged content representations, we significantly reduce logical fallacies in LLM outputs while maintaining content coherence. We demonstrate the effectiveness of LCF through experiments on conclusion generation and fallacy identification tasks, showing a significant improvement in logical validity and a reduction in fallacious outputs.
Loading