Keywords: Offline reinforcement learning, latent actions, identifiability, action-free data, demonstrator diversity
Abstract: A central bottleneck in transitioning from offline
representation learning to online decision-making
is the “action gap”: passive datasets (e.g., online
videos) often lack the action labels required to
ground latent representations in environment dynamics. We propose to bridge this gap by exploiting demonstrator diversity. Even when actions
are unobserved, systematic variation in transitions
across demonstrators can help disentangle latent
action choice from environment stochasticity. We
formalize this as a statewise column-stochastic
non-negative matrix factorization (NMF) problem, where demonstrator-specific policies act as
mixtures over shared latent transition kernels. Under “sufficiently scattered” policy diversity and
rank conditions, we prove that latent actions and
dynamics are identifiable up to a permutation. We
extend these results to continuous observation
spaces via a Gram-determinant minimum-volume
criterion and prove that spatial continuity ensures
a globally consistent action labeling. Our framework shows how heterogeneity in passive data
can substitute for missing action labels, leaving
limited interaction to resolve only the final action-
label alignment.
Submission Number: 92
Loading