Keywords: Soft Actor-Critic (SAC), Transformer-based Critic, Sequence Chunking, N-step Returns, Critic Alignment, Double Q-Learning, Deep Reinforcement Learning
TL;DR: T-SAC trains a transformer-based SAC critic on windowed sequence chunks with N-step TD returns and critic alignment, improving temporal credit assignment, stability, and sample efficiency on long-horizon, multi-phase, and sparse-reward tasks.
Abstract: We introduce a sequence-conditioned critic for Soft Actor--Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state--action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns---without importance sampling (IS). The resulting sequence-aware value estimates capture temporal structure critical for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training without a target network. Despite its simplicity---a 2-layer Transformer with 128--256 hidden units and a maximum update-to-data ratio (UTD) of $1$---the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.
Primary Area: reinforcement learning
Submission Number: 526
Loading