TY - GEN
T1 - AUTODAN-TURBO
T2 - 13th International Conference on Learning Representations, ICLR 2025
AU - Liu, Xiaogeng
AU - Li, Peiran
AU - Suh, Edward
AU - Vorobeychik, Yevgeniy
AU - Mao, Zhuoqing
AU - Jha, Somesh
AU - McDaniel, Patrick
AU - Sun, Huan
AU - Li, Bo
AU - Xiao, Chaowei
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025
Y1 - 2025
N2 - In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
AB - In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
UR - https://www.scopus.com/pages/publications/105010220016
UR - https://www.scopus.com/inward/citedby.url?scp=105010220016&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:105010220016
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 22337
EP - 22384
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
Y2 - 24 April 2025 through 28 April 2025
ER -