LLM-Fuzzer: Scaling Assessment of Large Language Model Jailbreaks

Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

The jailbreak threat poses a significant concern for Large Language Models (LLMs), primarily due to their potential to generate content at scale. If not properly controlled, LLMs can be exploited to produce undesirable outcomes, including the dissemination of misinformation, offensive content, and other forms of harmful or unethical behavior. To tackle this pressing issue, researchers and developers often rely on red-team efforts to manually create adversarial inputs and prompts designed to push LLMs into generating harmful, biased, or inappropriate content. However, this approach encounters serious scalability challenges. To address these scalability issues, we introduce an automated solution for large-scale LLM jailbreak susceptibility assessment called LLM-FUZZER. Inspired by fuzz testing, LLM-FUZZER uses human-crafted jailbreak prompts as starting points. By employing carefully customized seed selection strategies and mutation mechanisms, LLM-FUZZER generates additional jailbreak prompts tailored to specific LLMs. Our experiments show that LLM-FUZZER-generated jailbreak prompts demonstrate significantly increased effectiveness and transferability. This highlights that many open-source and commercial LLMs suffer from severe jailbreak issues, even after safety fine-tuning.

Original languageEnglish (US)
Title of host publicationProceedings of the 33rd USENIX Security Symposium
PublisherUSENIX Association
Pages4657-4674
Number of pages18
ISBN (Electronic)9781939133441
StatePublished - 2024
Event33rd USENIX Security Symposium, USENIX Security 2024 - Philadelphia, United States
Duration: Aug 14 2024Aug 16 2024

Publication series

NameProceedings of the 33rd USENIX Security Symposium

Conference

Conference33rd USENIX Security Symposium, USENIX Security 2024
Country/TerritoryUnited States
CityPhiladelphia
Period8/14/248/16/24

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'LLM-Fuzzer: Scaling Assessment of Large Language Model Jailbreaks'. Together they form a unique fingerprint.

Cite this