Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner, improving policy learning in continuous control tasks especially with suboptimal data.
Abstract: Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start” training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1.62× performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.
Lay Summary: Teaching robots new skills is most effective when they can learn both from trial-and-error and by watching humans. However, people don’t always demonstrate tasks perfectly, which makes it hard for robots to know the best way to act, slowing down their learning. To tackle this, we created a new method that helps robots learn effectively even from less-than-perfect examples. Our approach lets robots choose their actions in stages, starting with simple decisions and gradually making them more precise—similar to learning a dance step by step. We also designed our method so robots can learn each part of a movement in order, such as moving each joint one at a time, which helps them better understand and use messy demonstrations. When we tested this approach on popular robot learning tasks, it outperformed existing methods, especially when the training data was flawed. This means robots can pick up useful skills faster and with fewer mistakes, even if the examples they learn from aren’t perfect.
Link To Code: https://sites.google.com/view/ar-soft-q
Primary Area: Reinforcement Learning->Deep RL
Keywords: Value-based Reinforcement Learning, Continuous Control, Auto-Regressive, Suboptimal Demonstration
Submission Number: 9509
Loading