Reproducibility Study of "Explaining RL Decisions with Trajectories"

Published: 29 Apr 2024, Last Modified: 29 Apr 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: This paper reports on the reproducibility study on the paper `Explaining RL Decisions with Trajectories' by Deshmukh et al. (2023). The authors proposed a method to elucidate the decisions of an offline RL agent by attributing them to clusters of trajectories encountered during training. The original paper explored various environments and conducted a human study to gauge real-world performance. Our objective is to validate the effectiveness of their proposed approach. This paper conducted quantitative and qualitative experiments across three environments: a Grid-world, an Atari video game (Seaquest), and a continuous control task from MuJoCo (HalfCheetah). While the authors provided the code for the Grid-world environment, we re-implemented it for the Seaquest and HalfCheetah environments. This work extends the original paper by including trajectory rankings within a cluster, experimenting with alternative trajectory clustering, and expanding the human study. The results affirm the effectiveness of the method, both in its reproduction and in the additional experiments. However, the results of the human study suggest that the method's explanations are more challenging to interpret for humans in more complex environments. Our implementations can be found on GitHub.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to thank the reviewers kindly for their feedback. The following changes have been made to the manuscript: - p1."While offline RL research..." -> rewrote this for correctness. - p1. Second paragraph: 'critical state features' -> 'important aspects of observations'. Rewritten for clarity. - p1. Second paragraph: 'attribute' -> 'highlight which past experiences are responsible for a RL agent's decision'. Added better explanation. - p3. First paragraph: 'point to' -> 'attribute'. Less handwavy. - p3. " trained on every cluster's complementary dataset" now has reference to Figure 2d. - p4. Paragraph 3 was rewritten for better reading experience. - p4. Last paragraph. "A higher LMAAVD..." -> removed sentence, as it repeated information. - p5. Second paragraph. Added explanation of why lower NWD is preferable. - p6. Last paragraph. 'MachineLearning' -> 'Machine Learning' - p7. Section 4.1.2. Reiterated structure of cluster sets for more clarity. - p8. "... NWD of 0": added explanation for zero-values. - p9: "Anecdotal evidence" Rewrote this sentence to be less ambiguous. - p9: "... topologically and semantically similar to X-means" Rewrote this paragraph. - p10: "..this is due to the inherent difficulty of understanding RL trajectories for humans". We've rewritten this sentence to be more specific. - We've corrected the references.
Assigned Action Editor: ~Mohammad_Ghavamzadeh1
Submission Number: 2213