Track: Research Track
Keywords: Human Intent Alignment, Reward Inference, Reinforcement Learning, Vision–Language Models
Abstract: AI systems must act in ways that reflect human values and intentions, yet defining suitable reward signals remains a major challenge. Reinforcement learning enables agents to learn through trial and error, powering systems such as AlphaGo to superhuman performance. However, the common assumption that agents learn from a single reward function provided by the environment is often unrealistic beyond controlled benchmarks, and hand‑crafted rewards can be brittle or misaligned with human intent. To address these alignment challenges, we propose Inference-Based Reinforcement Learning (InfeRL), a framework for training agents with rewards inferred to match human goals. InfeRL allows an agent to infer its own reward by comparing its behavior to a high‑level goal. Goals can be expressed in natural language and interpreted through a vision–language model. This removes the need for explicit environment rewards and instead emphasizes semantic alignment with human‑described success. We evaluate InfeRL on standard Gymnasium environments which provide clear ground‑truth rewards for comparison. InfeRL achieves performance close to agents trained with environment rewards, while following tasks described in natural language rather than relying on handcrafted signals. It supports novel instructed behaviors, such as rotating or walking, purely from language goals, and demonstrates its capacity to handle multi-objective instructions involving spatial reasoning. This work represents a step toward reinforcement learning agents that are transparent, adaptable, and aligned with human values.
Submission Number: 115
Loading