Abstract: Large language models (LLMs) demonstrate significant generative capabilities but often face ethical alignment and robustness challenges. Conventional alignment methods rely on extensive human-annotated data and require retraining, leading to high computational costs and resource demands. Therefore, we propose a novel approach, Decoding Probability Correction (DPC), that aligns frozen LLMs without additional training or annotated data. DPC dynamically adjusts the probability distribution during inference, ensuring the generated content aligns with human values in real-time. Additionally, DPC incorporates a discriminator-based backtracking mechanism, further enhancing content safety by re-evaluating and refining generation choices. Experimental results on datasets such as the HH and AdvBench show that DPC significantly reduces harmful outputs while maintaining high levels of informativeness and helpfulness. The proposed method offers a cost-effective and efficient solution for enhancing the ethical alignment of LLMs in real-world applications.
Loading