\section{You \emph{can} have an appendix here.}



\end{document}
\section{Additional Details}

\subsection{Explanation of Game Objectives}
\label{game_objectives}
\textbf{Skiing.} The player controls the direction and speed of a skier and must avoid obstacles, such as trees and moguls. The goal is to reach the end of the course as rapidly as possible, but the skier must pass through a series of gates (indicated by a pair of closely spaced flags). Missed gates count as a penalty against the player's time.

\textbf{Tennis.} The player plays the game of tennis. When serving and returning shots, the tennis players automatically swing forehand or backhand as the situation demands, and all shots automatically clear the net and land in bounds.

\textbf{Ms. Pac-Man.}
The player earns points by eating pellets and avoiding ghosts, which contacting them results in losing a life. Eating a ``power pellet'' causes the ghosts to turn blue, allowing them to be eaten for extra points. Bonus fruits can be eaten for increasing point values, twice per round.

\textbf{Breakout} Using a single ball, the player must use the paddle to knock down as many bricks as possible. If the player's paddle misses the ball's rebound, the player will lose a life.

\subsection{Outputs of QA Extraction Module}
\begin{table}[h]
{\centering
\begin{tabular}{l|m{20em} @{\hskip 0.3in} m{20em}}
Game  & \multicolumn{1}{c}{Wikipedia} & \multicolumn{1}{c}{Official Manual}\\
\hline
& \multicolumn{2}{c}{What is the objective of the game?}\\
\cline{2-3}
\multirow{8}{3em}{Skiing} & To reach the bottom of the ski course as rapidly as possible. & To reach the bottom of the hill in the fastes time Have fun and god bless!".\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to succeed in the game?}\\
\cline{2-3}
& N/A & Learning to control the tips of your skis and anticipating and avoiding trouble. \\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to score at the game?}\\
\cline{2-3}
& N/A & Elapsed time at the end of the run.\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{Who are your enemies?}\\
\cline{2-3}
& N/A & N/A\\
\hline
& \multicolumn{2}{c}{What is the objective of the game?}\\
\cline{2-3}
\multirow{8}{3em}{Pacman} & The player earns points by eating pellets and avoiding ghosts. & To score as many points as you can practice clearing the maze of dots before trying to gobble up the ghosts.\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to succeed in the game?}\\
\cline{2-3}
& The player earns points by eating pellets and avoiding ghosts. & Score as many points as you can. \\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to score at the game?}\\
\cline{2-3}
& The player earns points by eating pellets and avoiding ghosts. & N/A\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{Who are your enemies?}\\
\cline{2-3}
& N/A & Ghosts. stay close to an energy pill before eating it, and tease the ghosts.\\
\hline
& \multicolumn{2}{c}{What is the objective of the game?}\\
\cline{2-3}
\multirow{8}{3em}{Tennis} & The first player to win one six-game set is declared the winner of the match. & N/A\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to succeed in the game?}\\
\cline{2-3}
& The first player to win one six-game set is declared the winner of the match. & precisely aim your shots and hit them out of reach of your opponent. \\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to score at the game?}\\
\cline{2-3}
& N/A & The first player to win at least 6 games and be ahead by two games wins the set.\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{Who are your enemies?}\\
\cline{2-3}
& N/A & N/A\\
\hline
& \multicolumn{2}{c}{What is the objective of the game?}\\
\cline{2-3}
\multirow{8}{3em}{Breakout} & The player must knock down as many bricks as possible. & Destroy the two walls using five balls To destroy the wall in as little time as possible The first player or team to completely destroy both walls or score the most points To smash their way through the wall and score points\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to succeed in the game?}\\
\cline{2-3}
& the player must knock down as many bricks as possible by using the walls and/or the paddle below to hit the ball against the bricks and eliminate them & The first team to destroy a wall or score the most points after playing five balls wins the game To destroy the walls in as little time as possible \\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to score at the game?}\\
\cline{2-3}
& N/A & Scores are determined by the bricks hit during a game A player scores points by hitting one of the wall's bricks Smash their way through the wall and score points.\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{Who are your enemies?}\\
\cline{2-3}
& N/A & N/A\\
\hline
\end{tabular}
}
\caption{\label{table:wiki_vs_official_FULL} Table showing the outputs of the QA Extraction module on Wikipedia instructions vs the official Atari manual. The Wikipedia manual is significantly shorter, and contains less information, causing the extractive QA model to repeat answers. However, the overall information, across the 4 questions, captured by the Extraction module is in good agreement across Wiki and Official manuals.}
\end{table}
\section{Background}\label{sec2}

\subsection{Reducing sample complexity in RL}
Prior works have attempted to improve sample efficiency with new exploration techniques \cite{schulman2017proximal,haarnoja2018soft,badia2020agent57}, self-supervised objectives \cite{pathak2017curiosity,muzero}, and static demonstration data \cite{kumar2019stabilizing,kumar2020conservative}. \citet{fan2022generalized} cast the RL training problems into a training data distribution optimization problem, and propose a policy mathematically controlled to traverse high-value and non-trivial states. However, such a solution requires manual reward shaping. Currently, the SOTA RL agent \cite{badia2020agent57} still requires billions of frames to complete the Skiing game that can be mastered by humans within minutes.


\subsection{Grounding objects for control problems}
\label{background:instruction}
Most prior works study grounding -- associating perceived objects with appropriate text -- in the setting of step-by-step instruction following for object manipulation tasks \cite{wang2016learning,bahdanau2018learning} or indoor navigation tasks \cite{chaplot2018gated,janner2018representation,chen2019touchdown,shridhar2020alfred}. 
The common approach has been to condition the agent's policy on the embedding of both the instruction and observation \cite{mei2016listen,hermann2017grounded,janner2018representation,misra2017mapping,chen2019touchdown,pashevich2021episodic}. Recently, modular solutions have been shown to be promising \cite{min2021film}. However, the majority of these works use synthetic language. These often take the form of templates, e.g., ``what colour is $<$object$>$ in $<$room$>$'' \cite{chevalier2018babyai}. 
On the other hand, recent attempts on large-scale image-text backbones, e.g., CLIP \cite{radford2021learning} have shown robustness at different styles of visual inputs, even on sketches. Therefore, CLIP may be the current best option for zero-shot grounding.

\subsection{Reinforcement learning informed by natural language}
\label{background:RL_text}
In the language-assisted setting, step-by-step instructions have been used to generate auxiliary rewards, when environment rewards are sparse. \citet{goyal2019using,wang2019reinforced} use auxiliary reward-learning modules trained offline to predict whether trajectory segments correspond to natural language annotations of expert trajectories.

In a more general setting, text may contain both information about optimal policy and environment dynamics. 
\citet{branavan2012learning} improve Monte-Carlo tree search planning by accessing a natural language manual. They apply their approach to the first few moves of the game Civilization II, a turn-based strategy game. However, many of the features they use are handcrafted to fit the game of Civilization II, which limits the generalization of their approach.

\citet{narasimhan2018grounding,wang2021grounding} investigate planning in a 2D game environment with a fixed number of entities that are annotated by natural language (e.g. the `spider' and `scorpion' entities might be annotated with the descriptions ``randomly moving enemy'' and ``an enemy who chases you'', respectively). The agent learns to generalize to new environment settings by learning the correlation between the annotations and the environmental goals and mechanics. The proposed agent achieves better performance than the baselines, which do not use natural language. However the design of the 2D environment uses 6-line descriptions created from templates. Such design over-simplifies the observations into a 2D integer matrix, and removes the need for visual understanding and generalization to new objects and new formats of instruction manual, e.g., human written manuals.

\subsection{RL Models that Read Natural Language Instructions}


\citet{zhong2019rtfm,zhong2021silg} make use of special architectures with Bidirectional Feature-wise Linear Modulation Layers that support multi-hop reasoning on 6 $\times$ 6 grid worlds with template-generated instruction manuals. However, the model requires 200 million training samples from templates identical to the test environments. Such a training requirement results in performance loss even on 10 $\times$ 10 grid worlds with identical mechanics, thus limiting the generalization of the model.

\citet{wang2021grounding} mixes embedding of entities with multi-modal attention to get entity representations, which are then fed to a CNN actor. The attention model and CNN actor are trained jointly using RL. However, the whole framework is designed primarily for a grid world with a template generated instruction manual consisting of one line per entity.
In our experiments, we find that extracting entity embeddings from human-written Atari instruction manuals following \citet{wang2021grounding} does not improve the performance of our vision-based agent due to the complexity and diversity of the language.

\section{Conclusions}
In this work, we propose Read and Reward, a method that assists and speeds up RL algorithms on the Atari challenges by reading downloaded game manuals released by the Atari game developers. Our method consists of a QA extraction module that extracts and summarizes relevant information from the manual and a reasoning module that assigns auxiliary rewards to detected in-game events by reasoning with information from the manual. The auxiliary reward is then provided to a standard A2C RL agent. 

To our knowledge, this work is the first successful attempt for using instruction manuals in a fully automated and generalizable framework for solving the widely used Atari RL benchmark \cite{badia2020agent57}. When assisted by our design, A2C improves on 4 games in Atari without immediate reward and requires 1000x fewer training frames compared to the previous SOTA Agent 57 on the hardest Skiing game.
\section{Limitations and Future Works}
\label{sec:limitations}
One of the main implementation challenges and limitations is the requirement of object localization and detection as mentioned in Section~\ref{sec:grounding}.
However, these limitations are not present in environments providing object ground-truth, such as modern game environments \cite{fan2022minedojo}, and virtual reality worlds \cite{ai2thor}. Other environments, such as indoor navigation and autonomous driving, include relatively reliable backbone object detection models \cite{he2017mask}.
In addition, with recent progress on video-language models, we believe that there will soon be more reliable solutions as drop-in replacements for our current algorithm for localization and detection.

Another simplification we made in Section~\ref{limitation_distance} is to consider only interactions where objects get sufficiently \textbf{close} to each other. While such simplification appears to be sufficient in Atari, where observations are 2D, one can imagine this to fail in a 3D environment like Minecraft \cite{fan2022minedojo} or an environment where multiple types of interaction could happen between same objects \cite{ai2thor}. Future works would benefit by grounding and tracking more types of events and interactions following \citet{wang2022language}.
\section{Introduction}\label{sec1}

Reinforcement Learning (RL) has achieved impressive performance in a variety of tasks, such as Atari \cite{mnih2015human,badia2020agent57}, Go \cite{silver2017mastering}, or autonomous driving \cite{GranTurismo,survey_driving}, and is hypothesized to be an important step toward artificial intelligence \cite{SILVER2021103535}. However, RL still faces a great challenge when applied to complicated, real-world-like scenarios \cite{shridhar2020alfworld,szot2021habitat} due to its low sample efficiency. 
The Skiing game in Atari, for example, requires the skier (agent) to ski down a snowy hill and hit the objective gates in the shortest time, while avoiding obstacles. Such an intuitive game with simple controls (left, right) still required 80 billion frames to solve with existing RL algorithms \cite{badia2020agent57}, roughly equivalent to 100 years of non-stop playing. 

Observing the gap between RL and human performance, we identify an important cause: the lack of knowledge and understanding about the game. One of the most available sources of information, that is often used by humans to solve real-world problems, is unstructured text, e.g., books, instruction manuals, and wiki pages. In addition, unstructured language has been experimentally verified as a powerful and flexible source of knowledge for complex tasks \cite{tessler2021learning}. Therefore, we hypothesize that RL agents could benefit from reading and understanding publicly available instruction manuals and wiki pages.

In this work, we demonstrate the possibility for RL agents to benefit from human-written instruction manuals. 
We download per-game instruction manuals released by the original game developers\footnote{\href{https://atariage.com/system_items.php?SystemID=2600&itemTypeID=MANUAL}{atariage.com}} or Wikipedia and experiment with 4 Atari games where the object location ground-truths are available. 

Two challenges arise with the use of human-written instruction manuals: 
1) \emph{Length}: Manuals contain a lot of redundant information and are too long for current language models. However, only a few paragraphs out of the 2-3 page manuals contain relevant information for any specific aspect of concern, i.e., the game's objective or a specific interaction. 
2) \emph{Reasoning}: References may be implicit and attribution may require reasoning across paragraphs. For example, in Skiing, the agent needs to understand that hitting an obstacle is bad using both ``you lose time if the skier hits an obstacle" and ``the objective is to finish in shortest time", which are from different parts of the manual. In addition, the grounding of in-game features to references in the text remains an open problem.
In addition to these challenges, manuals are unstructured and contain significant syntactic \emph{variation}. For example, the phrase ``don't hit an obstacle" is also referred to as ``don't get wrapped around a tree" or ``avoid obstacles". Finally, the grounding of in-game features to references in the text remains an open problem \cite{luketina2019survey}.

Our proposed Read and Reward framework (Figure~\ref{fig:archi}) addresses the challenges with a two-step process. First, a zero-shot extractive QA module is used to extract and summarize relevant information from the manual, and thus tackles the challenge of \emph{Length}. Second, a zero-shot reasoning module, powered by a large language model, \emph{reasons} about contexts provided by the extractive QA module and assigns auxiliary rewards to in-game interactions. After object detection and localization, our system registers an interaction based on the distance between objects. The auxiliary reward from the interaction can be consumed by any RL algorithm.

Even though we only consider ``hit" interactions (detected by tracking distance between objects) in our experiments, Read and Reward still achieves 60\% performance improvement on a baseline using 1000x fewer frames than the SOTA for the game of Skiing. In addition, since Read and Reward does not require explicit training on synthetic text \cite{hermann2017grounded,chaplot2018gated,janner2018representation,narasimhan2018grounding,zhong2019rtfm,zhong2021silg,wang2021grounding}, we observe consistent performance on manuals from two \emph{different} sources: Wikipedia and Official manuals. 

To our knowledge, our work is the first to demonstrate the capability to improve RL performance in an end-to-end setting using real-world manuals designed for human readers.
\section{Read and Reward}\label{sec:read_and_reward}

We identify two key challenges to understanding and using the instruction manual with an RL agent.

The first challenge is \textbf{Length}. 
Due to the sheer length of the raw text ($2\sim 3$ pages), current pre-trained language models cannot handle the raw manual due to input length constraints. Additionally, most works in the area of long-document reading and comprehension \cite{beltagy2020longformer,ainslie2020etc,zemlyanskiy2021readtwice} require a fine-tuning signal from the task, which is impractical for RL problems since the RL loss already suffers from high variance. Furthermore, an instruction manual may contain a lot of task-irrelevant information. For example, the official manual for MsPacman begins with: ``MS. PAC-MAN and characters are trademark of Bally Midway Mfg. Co. sublicensed to Atari, Inc. by Namco-America, Inc.''

\begin{figure*}[t]
\vspace{-3mm}
\centering
\centerline{\includegraphics[width=0.9\textwidth]{imgs/extract.png}}
\vspace{-6mm}
\caption{Illustration of the QA Extraction Module on the game PacMan. We obtain generic information about the game by running extractive QA on 4 generic questions (3 shown since one question did not have an answer). We then obtain object-specific information using a question template. We concatenate generic and object-specific information to obtain the $<$context$>$ string (used in Figure~\ref{fig:reasoning}).}
\label{fig:QA_extraction}
\vspace{-3mm}
\end{figure*}
It is therefore intuitive in our first step to have a QA Extraction Module that produces a summary directed toward game features and objectives, to remove distractions and simplify the problem of reasoning. 

The second challenge is \textbf{Reasoning}. Information in the manual may be implicit. For example, the Skiing manual states that the goal is to arrive at the end in the fastest time, and that the player will lose time when they hit a tree, but it never states directly that hitting a tree reduces the final reward. Most prior works \cite{eisenstein2009reading,narasimhan2018grounding,wang2021grounding,zhong2019rtfm,zhong2021silg} either lack the capability of multi-hop reasoning or require extensive training to form reasoning for specific manual formats, limiting the generalization to real-world manuals with a lot of variations like the Atari manuals.

On the other hand, large language models have achieved success without training in a lot of fields including reading comprehension, reasoning, and planning \cite{brown2020language,kojima2022large,ahn2022can}. Therefore, in the Reasoning Module, we compose natural language queries about specific object interactions in the games, to borrow the in-context reasoning power of large language models.

The rest of the section describes how the two modules within our Read and Reward framework (a QA extraction module and a reasoning module) are implemented.

\paragraph{QA Extraction Module (Read)}\label{method:qa_extraction}
In order to produce a summary of the manual directed towards game features and objectives, we follow the extractive QA framework first proposed by \citet{bert}, which extracts a text sub-sequence as the answer to a question about a passage of text.
The extractive QA framework takes raw text sequence $S_{\text{manual}}=\{w^0, ...,w^L\}$ and a question string $S_{\text{question}}$ as input. Then for each token (word) $w^i$ in $S_{\text{manual}}$, the model predicts the probability that the current token is the start token $p^i_{\text{start}}$ and end token $p^i_{\text{end}}$ with a linear layer on top of word piece embeddings. Finally, the output $S_{\text{answer}}=\{w^{\text{start}},...,w^{\text{end}}\}$ is selected as a sub-sequence of $S_{\text{manual}}$ that maximizes the overall probability of the start and end: $w^{\text{start}},w^{\text{end}} = \arg\max_{w^{\text{start}},w^{\text{end}}} p^{\text{start}}p^{\text{end}}$.

Implementation-wise, we use a RoBERTa-large model \cite{liu2019roberta} for Extractive QA fine-tuned on the SQUAD dataset \cite{rajpurkar2016squad}. Model weights are available from the AllenNLP API\footnote{\href{https://demo.allennlp.org/reading-comprehension/transformer-qa}{AllenNLP Transformer-QA}}. 
To handle instruction manual inputs longer than the maximum length accepted by RoBERTa, we split the manual into chunks according to the max-token size of the model, and concatenate extractive QA outputs on each chunk.

As shown in Figure~\ref{fig:QA_extraction}, we compose a set of generic prompts about the objective of the game, for example, ``What is the objective of the game?''. In addition, to capture information on agent-object interactions, we first identify the top 10 important objects using TFIDF. Then for each object, we obtain an answer to the generated query:``What happens when the player hit a $<$object$>$?''.

Finally, we concatenate all non-empty question-answer pairs per interaction into a $<$context$>$ string.
Note that the format of context strings closely resembles that of a QA dialogue.

\begin{figure}[h]
\centering
\centerline{\includegraphics[width=0.45\textwidth]{imgs/reasoning.png}}
\vspace{-3mm}
\caption{Illustration of the QA Extraction Module. The $<$context$>$ (from Figure~\ref{fig:QA_extraction}) related to the object \textbf{ghost} from the QA Extraction module is concatenated with a template-generated question to form the zero-shot in-context reasoning prompt for a Large Language Model. The Yes/No answer from the LLM is then turned into an auxiliary reward for the agent.}
\label{fig:reasoning}
\vspace{-3mm}
\end{figure}
\paragraph{Zero-shot reasoning with pre-trained QA model (Reward)}\label{method:read_and_reward:reasoning}
The summary $<$context$>$ string from the QA Extraction module could directly be used as prompts for reasoning through large language models \cite{brown2020language,kojima2022large,ahn2022can}.

Motivated by the dialogue structure of the summary, and the limit in computational resources, we choose Macaw \cite{macaw}, a general-purpose zero-shot QA model with performance comparable to GPT-3 \cite{brown2020language}. Compared to the RoBERTa extractive QA model for the QA Extraction module, Macaw is significantly larger and is better suited for reasoning. Although we find that GPT-3 generally provides more flexibility and a better explanation for its answers for our task, Macaw is faster and open-source. 

As shown in Figure~\ref{fig:reasoning}, for each interaction we compose the query prompt as ``$<$context$>$ Question: Should you hit a $<$object of interaction$>$ if you want to win? Answer: '', and calculate the LLM score on choices \{Yes, No\}. The Macaw model directly supports scoring the options in a multiple-choice setting. We note that for other generative LLMs, we could use the probability of ``Yes" v.s. ``No" as scores similar to \citep{ahn2022can}.

Finally, during game-play, a rule-based algorithm detects `hit' events by checking the distance between bounding boxes. For detected interactions, an auxiliary reward is provided to the RL algorithm according to \{Yes, No\} rating from the reasoning module (see Section~\ref{limitation_distance} for details).
\section{Experiments}

\subsection{Atari Environment}
The Openai gym Atari environment contains diverse Atari games designed to pose a challenge for human players. The observation consists of a single game screen (frame): a 2D array of 7-bit pixels, 160 pixels wide by 210 pixels high. The action space consists of the 18 discrete actions defined by the joystick controller. Instruction manuals released by the original game developers have been scanned and parsed into publicly available HTML format\footnote{\href{https://atariage.com/system_items.php?SystemID=2600&itemTypeID=MANUAL}{atariage.com}}. For completeness, we also attach an explanation of the objectives for each game in Appendix~\ref{game_objectives}.

\subsection{Atari Baselines}

\paragraph{A2C} A2C \cite{mnih2016asynchronous} and PPO \cite{schulman2017proximal} are among some of the earliest Actor-Critic algorithms that brought success for deep RL on Atari games. A2C learns a policy $\pi$ and a value function $V$ to reduce the variance of the REINFORCE \cite{sutton1999policy} algorithm. The algorithm optimizes the advantage function $A(s,a)=Q(s,a)-V(s)$ instead of the action value function $Q(s,a)$ to further stabilize training.

\paragraph{MuZero}
MuZero algorithm \cite{muzero} combines a tree-based search with a learned model of the environment. The model of MuZero iteratively predicts the reward, the action-selection policy, and the value function. When applied to the Atari benchmarks, MuZero achieves super-human performance in a lot of games.

\paragraph{R2D2}
R2D2 \cite{kapturowski2018recurrent} proposes an improved training strategy for RNN-based RL agents with distributed prioritized experience replay. The algorithm is the first to exceed human-level performance in 52 of the 57 Atari games.

\paragraph{Agent 57}
\citet{badia2020agent57} train a neural network that parameterizes a family of policies with different exploration, and proposes a meta-controller to choose which policy to prioritize during training. Agent 57 exceeds human-level performance on all 57 Atari games.

\subsection{Delayed Reward Schedule}\label{method:delayed_reward}
Most current Atari environments include a dense reward structure, i.e., the reward is obtained quite often from the environment. Indeed, most RL algorithms for the Atari environments perform well due to their dense reward structure.
However, in many real-world scenarios \cite{ai2thor,wang2022learning} or open-world environments \cite{fan2022minedojo}, it is expensive to provide a dense reward. The reward is usually limited and ``delayed" to a single positive or negative reward obtained at the very end of an episode (e.g., once the robot succeeds or fails to pour the coffee). 

Notably, Atari games with the above ``delayed" reward structure, such as Skiing, pose great challenges for current RL algorithms, and are among some of the hardest games to solve in Atari \cite{badia2020agent57}.
Therefore, to better align with real-world environments, we use a \emph{delayed reward} version of the tennis, breakout, and Pac-Man games by providing the reward (the ``final game score") only at the end of the game. This delayed reward structure does not change the optimization problem, but is more realistic and imposes additional challenges to RL algorithms.

\subsection{Grounding Objects in Atari}
\label{sec:grounding}
We find grounding (detecting and relating visual objects to keywords from TFIDF) a significant challenge to applying our framework. In some extensively studied planning/control problems, such as indoor navigation, pre-trained models such as masked RCNN reasonably solve the task of grounding \cite{shridhar2020alfred}. However, the objects in the Atari environments are too different from the training distribution for visual backbones like mask RCNN. Therefore, the problem becomes very challenging for the domain of Atari. Since it is not the main focus of our proposed work, we attempt to provide an unsupervised end-to-end solution only for the game of Skiing.

\paragraph{Full end-to-end pipeline on Skiing}
\label{skiing_pipeline}
To demonstrate the full potential of our work, we offer a proof-of-concept agent directly operating on raw visual game observations and the instruction manual for the game of Skiing. For unsupervised object localization, we use SPACE, a method that uses spatial attention and decomposition to detect visual objects \cite{lin2020space}. SPACE is trained on 50000 random observations from the environment, and generates object bounding boxes. These bounded boxes are fed to CLIP, a zero-shot model that can pair images with natural language descriptions \cite{radford2021learning}, to classify the bounding box into categories defined by key-words from TF-IDF. The SPACE/CLIP models produce mostly reasonable results; however, in practice we find that CLIP lacks reliability over time, and SPACE cannot distinguish objects that cover each other. Therefore, to improve classification reliability, we use a Multiple Object Tracker \cite{sort} and set the class of the bounding box to be the most dominant class over time.

\begin{figure}[ht]
\vspace{-3mm}
\centering
\centerline{\includegraphics[width=0.5\textwidth]{imgs/detection_failure.png}}
\vspace{-2mm}
\caption{\label{fig:detection_failure}Examples of SPACE \cite{lin2020space} and CLIP \cite{radford2021learning} in the full end-to-end pipeline (Section~\ref{skiing_pipeline}). The top row shows bounding boxes for objects and the bottom row shows corresponding object masks as detected by SPACE. Most of the bounding boxes generated are correct. \textbf{Left}: SPACE confuses bounding boxes of agent and tree into one and the box gets classified as ``tree" (blue), and the auxiliary penalty is not properly triggered. \textbf{Right}: The flag next to the agent (in red circle) is not detected, and therefore the auxiliary reward is not provided.}
\vspace{-4mm}
\end{figure}
\paragraph{Ground-truth object localization/grounding experiments}
\label{ram_pipeline}
For other games (Tennis, Ms. Pac-Man, and Breakout), we follow \citet{anand2019unsupervised} and directly calculate the coordinates of all game objects from the simulator RAM state. Specifically, we first manually label around 15 frames with object name and X,Y coordinates and then train a linear regression model mapping from simulator RAM state to labeled object coordinates. Note that this labeling strategy does not work for games in which multiple objects of the same kind appears in a frame, i.e., multiple trees appear in one frame of Skiing.

\subsection{Simplifying Interaction Detection to Distance Tracking}
\label{limitation_distance}
We make the assumption that all object interactions in Atari can be characterized by the distance between objects, and we ignore all interactions that do not contain the agent. We note that since the Atari environment is in 2D, it suffices to consider only interactions where objects get close to the agent so that we can query the manual for ``hit" interaction.

We, therefore, monitor the environment frames and register interaction whenever an object's bounding boxes intersect with the agent. For each detected interaction, we assign rewards from $\{-r_n, r_p\}$ accordingly as detailed in Section~\ref{sec:read_and_reward}. Experimentally, we find that it suffices to set $r_n=r_p=5$.

\subsection{RL Agent}
Since Read and Reward only provides an auxiliary reward, it is compatible with any RL algorithms like PPO, SAC, TD3, and even Agent 57 \cite{schulman2017proximal,haarnoja2018soft,fujimoto2018addressing,badia2020agent57}. For simplicity, we use a popular open-source implementation \cite{stooke2019rlpyt} of A2C \cite{mnih2016asynchronous}. 


\begin{figure}[t]
\vspace{-3mm}
\centering
\centerline{\includegraphics[width=0.5\textwidth]{figures/scatter.pdf}}
\vspace{-4mm}
\caption{\label{fig:skiing}Performance (in Game Score) vs Efficiency (in Frames Seen) of fully end-to-end Read and Reward (green) compared to an A2C baseline, and MuZero \cite{muzero} on Skiing. Benefiting from the auxiliary rewards from the instruction manual, Read and Reward out-performs the A2C Baseline and achieves 60\% improvement on random performance while using much fewer training frames compared to the SOTA mixed-policy Agent 57~\cite{badia2020agent57} and GDI-H3~\cite{fan2022generalized}.}
\vspace{-3mm}
\end{figure}
\begin{figure}[t]
\centering
\centerline{\includegraphics[width=0.55\textwidth]{figures/Breakout.pdf}}
\centerline{\includegraphics[width=0.55\textwidth]{figures/Tennis.pdf}}
\centerline{\includegraphics[width=0.55\textwidth]{figures/MsPacman.pdf}}
\caption{Comparison of the training curves of Read and Reward v.s. A2C Baseline in terms of game score alongside Auxiliary Reward v.s. Frames Seen under our delayed reward setting for 3 different Atari games. Read and Reward (red) consistently outperforms the A2C baseline (brown) in all games. In addition, the auxiliary reward from Read and Reward demonstrates a strong positive correlation with the game score, suggesting that the model benefits from optimizing the auxiliary rewards at training time.}
\label{fig:learning_curves}
\end{figure}
\begin{table*}[t]
{\centering
\begin{tabular}{l|m{20em} @{\hskip 0.3in} m{20em}}
Game  & \multicolumn{1}{c}{Wikipedia} & \multicolumn{1}{c}{Official Manual}\\
\hline

& \multicolumn{2}{c}{What is the objective of the game?}\\
\cline{2-3}
\multirow{8}{3em}{Pacman} & The player earns points by eating pellets and avoiding ghosts. & To score as many points as you can practice clearing the maze of dots before trying to gobble up the ghosts.\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to succeed in the game?}\\
\cline{2-3}
& The player earns points by eating pellets and avoiding ghosts. & Score as many points as you can. \\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{How to score at the game?}\\
\cline{2-3}
& The player earns points by eating pellets and avoiding ghosts. & N/A\\
\cline{2-3}\cline{2-3}
& \multicolumn{2}{c}{Who are your enemies?}\\
\cline{2-3}
& N/A & Ghosts. stay close to an energy pill before eating it, and tease the ghosts.\\
\hline
\end{tabular}
}
\caption{\label{table:wiki_vs_official} Table showing the outputs of the QA Extraction module on Wikipedia instructions vs the official Atari manual. The Wikipedia manual is significantly shorter, and contains less information, causing the extractive QA model to use the same answer for all questions. Full table for all 4 games is shown in Table~\ref{table:wiki_vs_official_FULL} in the Appendix.}
\vspace{-3mm}
\end{table*}
\subsection{Full Pipeline Results on Skiing}
We compare the performance of our Read and Reward agent using the full pipeline to baselines in Figure~\ref{fig:skiing}, our agent is best on the efficiency axis and close to best on the performance axis. A2C equipped with Read and Reward outperforms A2C baseline and more complex methods like MuZero \cite{muzero}. We note a 50\% performance gap between our agent and the SOTA Agent 57, but our solution does not require policy mixing and requires better than 1000 times fewer frames to converge to performance 60\% better than prior works. 

When experimenting with a full end-to-end pipeline as described in Section~\ref{skiing_pipeline}, we notice instability with the detection and grounding models. As shown in Figure~\ref{fig:detection_failure}, missing bounding boxes cause lots of mislabeling when objects are close to each other. Such errors may lead to the auxiliary reward being incorrectly assigned. We observe that performance could further be improved 

Although Skiing is arguably one of the easiest games in Atari for object detection and grounding, with easily distinguishable objects and white backgrounds, we still observe that current unsupervised object detection and grounding techniques lack reliability. 

\begin{table}[h]
{\centering
\begin{tabular}{lccc}
                                             & R\& R Wiki & R\& R Official & A2C\\
\hline
Tennis                                       & \textbf{-5} & \textbf{-5} & -23 \\
Pacman                                  & \textbf{580} & \textbf{580} & 452 \\
Breakout                                     & \textbf{14} & \textbf{14} & 2  
\end{tabular}
}
\vspace{-2mm}
\caption{\label{table:wiki_results} Table of the game score of algorithms trained under \emph{delayed reward} schedule, Read and Reward consistently outperforms the baseline A2C agent, which does not use the manual, using either Wikipedia (\textbf{Wiki}) instructions or official Atari manual (\textbf{Official}).}
\vspace{-3mm}
\end{table}
\subsection{Results on Games with Delayed Reward Structure}

Since the problem of object detection and grounding is not the main focus of this paper, we obtain ground-truth labeling for object locations and classes as mentioned in Section~\ref{ram_pipeline}. In addition, for a setting more generalizable to other environments~\cite{ai2thor,fan2022minedojo}, we implement \emph{delayed reward} schedule (Section~\ref{method:delayed_reward}) for Tennis, MsPacman, and Breakout. 

We plot the Game Score and the auxiliary rewards of the Read and Reward agent v.s. the A2C baseline in Figure~\ref{fig:learning_curves}. The A2C baseline fails to learn under the sparse rewards, while the performance of the Read and Reward agent continues to increase with more frames seen. In addition, we observe that the auxiliary reward from the Read and Reward framework has a strong positive correlation with the game score (game reward) of the agent, suggesting that the A2C agent benefits from optimizing the auxiliary rewards at training time.

\subsection{R\&R Behavior on Manuals from Different Source}
To illustrate the generalization of our proposed framework to a different source of information, we download the ``Gameplay" section from Wikipedia\footnote{\href{https://en.wikipedia.org/}{wikipedia.org}} for each game and feed the section as manual to our model. In Table~\ref{table:wiki_vs_official}, we show a comparison of the answers on the set of 4 generic questions on the Pacman by our QA Extraction Module. Note that the QA Extraction module successfully extracts all important information about avoiding ghosts and eating pellets to get points. In addition, since the official manual contains more information than the Gameplay section of Wikipedia (2 pages v.s. less than 4 lines), we were able to extract much more information from the official manual. 

Due to the high similarity in the output from the QA Extraction Module, the Reasoning module demonstrates good agreement for both the Wiki manual and the official manual. Therefore, we observe consistent agent performance between the official manual and the Wikipedia manual in terms of game score (Table~\ref{table:wiki_results}).