Robustness Evaluation of Proxy Models against Adversarial Optimization

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: proxy gaming, reward hacking, specification gaming, misspecification, robustness, adversarial robustness, adversarial attacks, alignment, ai safety
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Ensuring the robustness of neural network proxies to optimization pressure is crucial as machine learning applications expand across diverse domains. However, research on proxy robustness remains limited and largely unexplored. In this paper, we introduce a comprehensive benchmark for investigating the robustness of neural network proxies under various sources of optimization pressure in the text domain. Through extensive experiments using our benchmark, we uncover previously unknown properties of the proxy gaming problem and highlight serious issues with proxy reward models currently used to fine-tune or monitor large language models. Furthermore, we explore different approaches to enhance proxy robustness and demonstrate the potential of adversarial training to improve alignment between proxy and gold models. Our findings suggest that proxy robustness is a solvable problem that can be incrementally improved, laying the groundwork for future research in this important area.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8708
Loading