Keywords: Model editing
Abstract: Large language models (LLMs) exhibit exceptional performance across various domains, and model editing holds significant potential to improve LLM safety and mitigate issues such as hallucinations.
Existing model editing methods either modify the original hidden states directly to integrate new knowledge, which can lead to the accumulation of conflicts,
or they achieve stable knowledge updates by adding extra parameters, which significantly increases computational costs.
To address these challenges, we propose AGRADE, a method that Adaptively guides the GRADient to compute Editing weights in alignment with the desired direction. By leveraging gradient differences across modules, AGRADE effectively reduces redundant information interference between adjacent modules, while controlling computational overhead and enhancing editing precision.
We theoretically prove the effectiveness of AGRADE and conduct extensive experiments across multiple LLMs and datasets. The results show an average improvement of over 4% across three metrics, with an overall score increase of 11.98%. We will release the code in the future.
Submission Number: 16
Loading