DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks. Our code and data are available at: https://github.com/liu00222/Open-Prompt-Injection.

Original languageEnglish (US)
Title of host publicationProceedings - 46th IEEE Symposium on Security and Privacy, SP 2025
EditorsMarina Blanton, William Enck, Cristina Nita-Rotaru
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2190-2208
Number of pages19
ISBN (Electronic)9798331522360
DOIs
StatePublished - 2025
Event46th IEEE Symposium on Security and Privacy, SP 2025 - San Francisco, United States
Duration: May 12 2025May 15 2025

Publication series

NameProceedings - IEEE Symposium on Security and Privacy
ISSN (Print)1081-6011

Conference

Conference46th IEEE Symposium on Security and Privacy, SP 2025
Country/TerritoryUnited States
CitySan Francisco
Period5/12/255/15/25

All Science Journal Classification (ASJC) codes

  • Software
  • Safety, Risk, Reliability and Quality
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks'. Together they form a unique fingerprint.

Cite this