TY - GEN
T1 - LLM-Fuzzer
T2 - 33rd USENIX Security Symposium, USENIX Security 2024
AU - Yu, Jiahao
AU - Lin, Xingwei
AU - Yu, Zheng
AU - Xing, Xinyu
N1 - Publisher Copyright:
© USENIX Security Symposium 2024.All rights reserved.
PY - 2024
Y1 - 2024
N2 - The jailbreak threat poses a significant concern for Large Language Models (LLMs), primarily due to their potential to generate content at scale. If not properly controlled, LLMs can be exploited to produce undesirable outcomes, including the dissemination of misinformation, offensive content, and other forms of harmful or unethical behavior. To tackle this pressing issue, researchers and developers often rely on red-team efforts to manually create adversarial inputs and prompts designed to push LLMs into generating harmful, biased, or inappropriate content. However, this approach encounters serious scalability challenges. To address these scalability issues, we introduce an automated solution for large-scale LLM jailbreak susceptibility assessment called LLM-FUZZER. Inspired by fuzz testing, LLM-FUZZER uses human-crafted jailbreak prompts as starting points. By employing carefully customized seed selection strategies and mutation mechanisms, LLM-FUZZER generates additional jailbreak prompts tailored to specific LLMs. Our experiments show that LLM-FUZZER-generated jailbreak prompts demonstrate significantly increased effectiveness and transferability. This highlights that many open-source and commercial LLMs suffer from severe jailbreak issues, even after safety fine-tuning.
AB - The jailbreak threat poses a significant concern for Large Language Models (LLMs), primarily due to their potential to generate content at scale. If not properly controlled, LLMs can be exploited to produce undesirable outcomes, including the dissemination of misinformation, offensive content, and other forms of harmful or unethical behavior. To tackle this pressing issue, researchers and developers often rely on red-team efforts to manually create adversarial inputs and prompts designed to push LLMs into generating harmful, biased, or inappropriate content. However, this approach encounters serious scalability challenges. To address these scalability issues, we introduce an automated solution for large-scale LLM jailbreak susceptibility assessment called LLM-FUZZER. Inspired by fuzz testing, LLM-FUZZER uses human-crafted jailbreak prompts as starting points. By employing carefully customized seed selection strategies and mutation mechanisms, LLM-FUZZER generates additional jailbreak prompts tailored to specific LLMs. Our experiments show that LLM-FUZZER-generated jailbreak prompts demonstrate significantly increased effectiveness and transferability. This highlights that many open-source and commercial LLMs suffer from severe jailbreak issues, even after safety fine-tuning.
UR - http://www.scopus.com/inward/record.url?scp=85204982988&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204982988&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204982988
T3 - Proceedings of the 33rd USENIX Security Symposium
SP - 4657
EP - 4674
BT - Proceedings of the 33rd USENIX Security Symposium
PB - USENIX Association
Y2 - 14 August 2024 through 16 August 2024
ER -