AUTODAN-TURBO: A LIFELONG AGENT FOR STRATEGY SELF-EXPLORATION TO JAILBREAK LLMS

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

Original languageEnglish (US)
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages22337-22384
Number of pages48
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: Apr 24 2025Apr 28 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period4/24/254/28/25

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'AUTODAN-TURBO: A LIFELONG AGENT FOR STRATEGY SELF-EXPLORATION TO JAILBREAK LLMS'. Together they form a unique fingerprint.

Cite this