TY - GEN
T1 - Semantic-Preserving Adversarial Example Attack against BERT
AU - Gao, Chongyang
AU - Gu, Kang
AU - Vosoughi, Soroush
AU - Mehnaz, Shagufta
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.
AB - Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.
UR - https://www.scopus.com/pages/publications/105000823688
UR - https://www.scopus.com/inward/citedby.url?scp=105000823688&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.trustnlp-1.17
DO - 10.18653/v1/2024.trustnlp-1.17
M3 - Conference contribution
AN - SCOPUS:105000823688
T3 - TrustNLP 2024 - 4th Workshop on Trustworthy Natural Language Processing, Proceedings of the Workshop
SP - 202
EP - 207
BT - TrustNLP 2024 - 4th Workshop on Trustworthy Natural Language Processing, Proceedings of the Workshop
A2 - Ovalle, Anaelia
A2 - Chang, Kai-Wei
A2 - Cao, Yang Trista
A2 - Mehrabi, Ninareh
A2 - Zhao, Jieyu
A2 - Galstyan, Aram
A2 - Dhamala, Jwala
A2 - Kumar, Anoop
A2 - Gupta, Rahul
PB - Association for Computational Linguistics (ACL)
T2 - 4th Workshop on Trustworthy Natural Language Processing, TrustNLP 2024
Y2 - 21 June 2024
ER -