CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

Armin Saghafian; Amirmohammad Izadi; Negin Hashemi Dijujin; Mahdieh Soleymani Baghshah

CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

Armin Saghafian, Amirmohammad Izadi, Negin Hashemi Dijujin, Mahdieh Soleymani Baghshah

Published: 15 Aug 2025, Last Modified: 15 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Grounding the instruction in the environment is a key step in solving language-guided goal-reaching reinforcement learning problems. In automated reinforcement learning, a key concern is to enhance the model's ability to generalize across various tasks and environments. In goal-reaching scenarios, the agent must comprehend the different parts of the instructions within the environmental context in order to complete the overall task successfully. In this work, we propose \textbf{CAREL} (\textit{\textbf{C}ross-modal \textbf{A}uxiliary \textbf{RE}inforcement \textbf{L}earning}) as a new framework to solve this problem using auxiliary loss functions inspired by video-text retrieval literature and a novel method called instruction tracking, which automatically keeps track of progress in an environment. The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems.

Submission Length: Regular submission (no more than 12 pages of main content)

Video: https://drive.google.com/file/d/1UTyxoo0VW6DFksMTpnM0LGugrcQEAOEg/view?usp=sharing

Code: https://github.com/ArminS03/CAREL

Supplementary Material: zip

Assigned Action Editor: ~Marlos_C._Machado1

Submission Number: 4899

Loading