Formalizing and Benchmarking Prompt Injection Attacks and Defenses

  • Yupei Liu
  • , Yuqi Jia
  • , Runpeng Geng
  • , Jinyuan Jia
  • , Neil Zhenqiang Gong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.

Original languageEnglish (US)
Title of host publicationProceedings of the 33rd USENIX Security Symposium
PublisherUSENIX Association
Pages1831-1847
Number of pages17
ISBN (Electronic)9781939133441
StatePublished - 2024
Event33rd USENIX Security Symposium, USENIX Security 2024 - Philadelphia, United States
Duration: Aug 14 2024Aug 16 2024

Publication series

NameProceedings of the 33rd USENIX Security Symposium

Conference

Conference33rd USENIX Security Symposium, USENIX Security 2024
Country/TerritoryUnited States
CityPhiladelphia
Period8/14/248/16/24

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Formalizing and Benchmarking Prompt Injection Attacks and Defenses'. Together they form a unique fingerprint.

Cite this