GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora; R Devon Hjelm; Omar Attia; Alexander T Toshev; Bogdan Mazoure

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora, R Devon Hjelm, Omar Attia, Alexander T Toshev, Bogdan Mazoure

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: inverse reinforcement learning, large language models, evolution

TL;DR: We introduce GRACE, a framework that uses code-generating language models within an evolutionary search to learn an interpretable reward function as executable Python code directly from expert demonstrations

Abstract: Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield black-box models that are difficult to interpret and debug. In this work, we introduce GRACE (**G**enerating **R**ewards **A**s **C**od**E**), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the MuJoCo, BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

Primary Area: reinforcement learning

Submission Number: 12260

Loading