Can Go AIs be adversarially robust?

Tom Tseng; Euan McLean; Kellin Pelrine; Tony Tong Wang; Adam Gleave

Can Go AIs be adversarially robust?

Tom Tseng, Euan McLean, Kellin Pelrine, Tony Tong Wang, Adam Gleave

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: robustness, reinforcement learning, alignment, Go

TL;DR: By testing 3 natural ways to improve robustness of superhuman Go AIs, we show that achieving robustness is challenging even in narrow domains.

Abstract: Prior work found that superhuman Go AIs such as KataGo are vulnerable to opponents playing simple adversarial strategies. This shows that superhuman average-case capabilities may not translate to satisfactory worst-case robustness. However, Go AIs were never designed with security in mind, raising the question: can simple defenses make KataGo robust? In this paper, we test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find these defenses protect against previously discovered attacks, but we uncover several qualitatively distinct adversarial strategies that beat our defended agents. Our results suggest that achieving robustness is challenging, even in narrow domains such as Go. Our code is available at https://github.com/AlignmentResearch/go_attack.

Submission Number: 87

Loading