Enhancing RL Safety with Counterfactual LLM Reasoning

Published: 2024, Last Modified: 10 Jan 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.
Loading