TY - JOUR
T1 - Reverse engineering imperceptible backdoor attacks on deep neural networks for detection and training set cleansing
AU - Xiang, Zhen
AU - Miller, David J.
AU - Kesidis, George
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2021/7
Y1 - 2021/7
N2 - Backdoor data poisoning (a.k.a. Trojan attack) is an emerging form of adversarial attack usually against deep neural network image classifiers. The attacker poisons the training set with a relatively small set of images from one (or several) source class(es), embedded with a backdoor pattern and labeled to a target class. For a successful attack, during operation, the trained classifier will: 1) misclassify a test image from the source class(es) to the target class whenever the backdoor pattern is present; 2) maintain high classification accuracy for backdoor-free test images. In this paper, we make a breakthrough in defending backdoor attacks with imperceptible backdoor patterns (e.g. watermarks) before/during the classifier training phase. This is a challenging problem because it is a priori unknown which subset (if any) of the training set has been poisoned. We propose an optimization-based reverse engineering defense that jointly: 1) detects whether the training set is poisoned; 2) if so, accurately identifies the target class and the training images with the backdoor pattern embedded; and 3) additionally, reverse engineers an estimate of the backdoor pattern used by the attacker. In benchmark experiments on CIFAR-10 (as well as four other data sets), considering a variety of attacks, our defense achieves a new state-of-the-art by reducing the attack success rate to no more than 4.9% after removing detected suspicious training images.
AB - Backdoor data poisoning (a.k.a. Trojan attack) is an emerging form of adversarial attack usually against deep neural network image classifiers. The attacker poisons the training set with a relatively small set of images from one (or several) source class(es), embedded with a backdoor pattern and labeled to a target class. For a successful attack, during operation, the trained classifier will: 1) misclassify a test image from the source class(es) to the target class whenever the backdoor pattern is present; 2) maintain high classification accuracy for backdoor-free test images. In this paper, we make a breakthrough in defending backdoor attacks with imperceptible backdoor patterns (e.g. watermarks) before/during the classifier training phase. This is a challenging problem because it is a priori unknown which subset (if any) of the training set has been poisoned. We propose an optimization-based reverse engineering defense that jointly: 1) detects whether the training set is poisoned; 2) if so, accurately identifies the target class and the training images with the backdoor pattern embedded; and 3) additionally, reverse engineers an estimate of the backdoor pattern used by the attacker. In benchmark experiments on CIFAR-10 (as well as four other data sets), considering a variety of attacks, our defense achieves a new state-of-the-art by reducing the attack success rate to no more than 4.9% after removing detected suspicious training images.
UR - http://www.scopus.com/inward/record.url?scp=85105037298&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85105037298&partnerID=8YFLogxK
U2 - 10.1016/j.cose.2021.102280
DO - 10.1016/j.cose.2021.102280
M3 - Article
AN - SCOPUS:85105037298
SN - 0167-4048
VL - 106
JO - Computers and Security
JF - Computers and Security
M1 - 102280
ER -