Trojaning language models for fun and profit

Xinyang Zhang, Zheng Zhang, Shouling Ji, Ting Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

36 Scopus citations


Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TrojanLM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that TrojanLM possesses the following properties: (i) flexibility - the adversary is able to flexibly define logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when 'trigger' -embedded inputs are present, (iii) specificity - the trojan LMs function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. We provide analytical justification for the practicality of TrojanLM, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE European Symposium on Security and Privacy, Euro S and P 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages19
ISBN (Electronic)9781665414913
StatePublished - Sep 2021
Event6th IEEE European Symposium on Security and Privacy, Euro S and P 2021 - Virtual, Online, Austria
Duration: Sep 6 2021Sep 10 2021

Publication series

NameProceedings - 2021 IEEE European Symposium on Security and Privacy, Euro S and P 2021


Conference6th IEEE European Symposium on Security and Privacy, Euro S and P 2021
CityVirtual, Online

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality


Dive into the research topics of 'Trojaning language models for fun and profit'. Together they form a unique fingerprint.

Cite this