Unsupervised Action-Policy Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components

Yamen Habib; Dmytro Grytskyy; Rubén Moreno-Bote

Unsupervised Action-Policy Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components

Yamen Habib, Dmytro Grytskyy, Rubén Moreno-Bote

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Online unsupervised reinforcement learning, Maximum‐Occupancy Principle, Action-Policy Quantization

TL;DR: An online unsupervised RL method that partitions actions into specialized policies via a joint entropy objective yielding interpretable, reusable behavior tokens in a fully online, task-agnostic framework.

Abstract: Most reinforcement learning approaches optimize behavior to maximize a task-specific reward. However, from this, it is difficult to learn transferable token-like behaviors that can be reused and composed to solve arbitrary downstream tasks. We introduce an online unsupervised reinforcement learning framework that autonomously quantizes the agent’s action space into component policies via a joint entropy objective—maximizing the cumulative entropy of an overall mixture policy to ensure diverse, exploratory behavior under the maximum-occupancy principle, while minimizing the entropy of each component to enforce diversity and high specialization. Unlike existing approaches, our framework tackles action quantization into state-dependent component policies in a fully principled, unsupervised, online manner. We prove convergence in the tabular setting through a novel policy iteration algorithm, then extend to continuous control by fixing the discovered components and deploying them deterministically within an online optimizer to maximize cumulative reward. Empirical results demonstrate that our {\em maxi-mix-mini-com} entropy-based action-policy quantization provides interpretable, reusable token-like behavioral patterns, yielding a fully online, task-agnostic, scalable architecture that requires no task-specific offline data and transfers readily across tasks. Project Page: https://yamenhabib.com/maxi-mix-mini-com/

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Rubén_Moreno-Bote1

Track: Regular Track: unpublished work

Submission Number: 30

Loading