Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: POEM enables few-shot policy adaptation from human preferences by learning a task embedding space that preserves preference order.
Abstract: This paper studies meta-reinforcement learning with adaptation from human feedback. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the framework *adaptation via Preference-Order-preserving EMbedding* (POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the adaptation from human feedback, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-of-the-art performance with substantial improvement over baseline methods.
Lay Summary: This paper introduces a new way to teach AI systems how to quickly adapt to new tasks using feedback from humans instead of complex programming or reward setups. The method helps the AI learn patterns across many training tasks, so that when it faces a new task, it can understand what to do just by comparing options that people prefer. This makes the training process much faster and more efficient, especially in situations where it’s hard to define what success looks like. The approach shows strong results in robotic simulations, performing as well or better than existing methods while using much less human input.
Primary Area: Reinforcement Learning
Keywords: Meta-RL, prefrence-based RL, RLHF
Submission Number: 3396
Loading