Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components

ICLR 2026 Conference Submission14016 Authors

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Online unsupervised reinforcement learning, Maximum‐Occupancy Principle, Behavioral Tokenization, Action Quantization
TL;DR: Learn a concise, unsupervised, task-agnostic discrete vocabulary of short-lived, state-dependent behavioral tokens for continuous control so they can be composed to solve downstream tasks.
Abstract: A fundamental problem in reinforcement learning is how to learn a concise discrete set of behaviors that can be easily composed to solve any downstream task. An effective "tokenization" of behavior requires tokens to be transferable, learnable in an unsupervised manner, and short-lived to facilitate composability, akin to tokens in large-language models. Previous work has studied this problem using action quantization, sub-policies, options and skills, but no solution provides an unsupervised, task-agnostic method to discover short-lived, action-spanning behavioral tokens. We introduce an online unsupervised framework that autonomously discovers behavioral tokens via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure rich, exploratory behavior under the maximum occupancy principle, while minimizing the entropy of each component policy to enforce diversity and high specialization. We prove convergence in the tabular setting of our policy iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically as quantized actions within an online optimizer to maximize reward. Experiments demonstrate that our {\em maxi-mix-mini-com} entropy-based tokenization and action quantization provide interpretable and reusable behavioral tokens that can be sequenced to achieve competitive performance against continuous-action controllers and that largely outperform random action quantization and skill discovery methods.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 14016
Loading