Our submission contains three supplementary materials:

1. The appendix (i.e., supplementary text and figures) is presented after the reference in the main paper, as suggested by the author guidance.
2. The demo video of the timing accuracy experiment demonstrates the effectiveness of our PrObe in detecting policy behavior errors with accurate timing.
3. The complete representation visualizations support our claims and makes our PrObe better explainability.

Regarding our work's reproducibility, we have provided implementation details of FSI policy and AED methods in the main paper and appendix. We will release the source code right after the paper gets accepted. Moreover, we would love to provide the evaluation-only code confidentially during the author-reviewer discussion if any reviewer requires it.