
\begin{acknowledgements}
    This work was partly supported by the Institute of Information \& communications Technology Planning \& Evaluation (IITP) grant funded by the Korean government (MSIT) (IITP-2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH) and RS-2022-II220959, (part2) FewShot learning of Causal Inference in Vision and Language
    for Decision Making), the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (2022R1F1A1064569, RS-2023-00210466, RS-2023-00265444), and POSCO Creative Ideas grant (2023Q032).
    Sungbin Shin was supported by Kwanjeong Educational Foundation Scholarship.
    M.A. was supported by the Google Fellowship and Open Phil AI Fellowship.
\end{acknowledgements}