Abstract: Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, either curation of new data for continual alignment or manual correction of outdated data for re-alignment, demand costly human resources. To address this, we propose a novel approach, LLM BehAvior Correction with INfluence FunCtion REcall and Post-Training (LANCET), which needs no human involvement. LANCET consists of two phases: (1) using a new method LinFAC to efficiently identify the training data that significantly impact undesirable model outputs, and (2) applying an novel Influence-driven Bregman Optimization (IBO) technique to adjust the model’s outputs based on these influence distributions. Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. Further more, LANCET exhibits stronger generalization ability than all baselines under out-of-distribution harmful prompts, offering better interpretability and compatibility with real-world applications of LLMs.
Loading