TY - GEN
T1 - Measuring and modeling the label dynamics of online anti-malware engines
AU - Zhu, Shuofei
AU - Shi, Jianjun
AU - Yang, Limin
AU - Qin, Boqin
AU - Zhang, Ziyi
AU - Song, Linhai
AU - Wang, Gang
N1 - Publisher Copyright:
© 2020 by The USENIX Association. All Rights Reserved.
PY - 2020
Y1 - 2020
N2 - VirusTotal provides malware labels from a large set of anti-malware engines, and is heavily used by researchers for malware annotation and system evaluation. Since different engines often disagree with each other, researchers have used various methods to aggregate their labels. In this paper, we take a data-driven approach to categorize, reason, and validate common labeling methods used by researchers. We first survey 115 academic papers that use VirusTotal, and identify common methodologies. Then we collect the daily snapshots of VirusTotal labels for more than 14,000 files (including a subset of manually verified ground-truth) from 65 VirusTotal engines over a year. Our analysis validates the benefits of threshold-based label aggregation in stabilizing files' labels, and also points out the impact of poorly-chosen thresholds. We show that hand-picked “trusted” engines do not always perform well, and certain groups of engines are strongly correlated and should not be treated independently. Finally, we empirically show certain engines fail to perform in-depth analysis on submitted files and can easily produce false positives. Based on our findings, we offer suggestions for future usage of VirusTotal for data annotation.
AB - VirusTotal provides malware labels from a large set of anti-malware engines, and is heavily used by researchers for malware annotation and system evaluation. Since different engines often disagree with each other, researchers have used various methods to aggregate their labels. In this paper, we take a data-driven approach to categorize, reason, and validate common labeling methods used by researchers. We first survey 115 academic papers that use VirusTotal, and identify common methodologies. Then we collect the daily snapshots of VirusTotal labels for more than 14,000 files (including a subset of manually verified ground-truth) from 65 VirusTotal engines over a year. Our analysis validates the benefits of threshold-based label aggregation in stabilizing files' labels, and also points out the impact of poorly-chosen thresholds. We show that hand-picked “trusted” engines do not always perform well, and certain groups of engines are strongly correlated and should not be treated independently. Finally, we empirically show certain engines fail to perform in-depth analysis on submitted files and can easily produce false positives. Based on our findings, we offer suggestions for future usage of VirusTotal for data annotation.
UR - http://www.scopus.com/inward/record.url?scp=85091937990&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091937990&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85091937990
T3 - Proceedings of the 29th USENIX Security Symposium
SP - 2361
EP - 2378
BT - Proceedings of the 29th USENIX Security Symposium
PB - USENIX Association
T2 - 29th USENIX Security Symposium
Y2 - 12 August 2020 through 14 August 2020
ER -