TY - GEN
T1 - Perturbations in the Wild
T2 - Findings of the Association for Computational Linguistics: ACL 2022
AU - Le, Thai
AU - Lee, Jooyoung
AU - Yen, Kevin
AU - Hu, Yifan
AU - Lee, Dongwon
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness-i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83% and 91% attack success rates on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50% and 40% in terms of semantic preservation and stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API. Source code will be published at github.com/lethaiq/perturbations-in-the-wild.
AB - We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness-i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83% and 91% attack success rates on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50% and 40% in terms of semantic preservation and stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API. Source code will be published at github.com/lethaiq/perturbations-in-the-wild.
UR - http://www.scopus.com/inward/record.url?scp=85141487259&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141487259&partnerID=8YFLogxK
U2 - 10.18653/v1/2022.findings-acl.232
DO - 10.18653/v1/2022.findings-acl.232
M3 - Conference contribution
AN - SCOPUS:85141487259
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 2953
EP - 2965
BT - ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022
A2 - Muresan, Smaranda
A2 - Nakov, Preslav
A2 - Villavicencio, Aline
PB - Association for Computational Linguistics (ACL)
Y2 - 22 May 2022 through 27 May 2022
ER -