From Grim Reality to Practical Solution: Malware Classification in Real-World Noise

Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Malware datasets inevitably contain incorrect labels due to the shortage of expertise and experience needed for sample labeling. Previous research demonstrated that a training dataset with incorrectly labeled samples would result in inaccurate model learning. To address this problem, researchers have proposed various noise learning methods to offset the impact of incorrectly labeled samples, and in image recognition and text mining applications, these methods demonstrated great success. In this work, we apply both representative and state-of-the-art noise learning methods to real-world malware classification tasks. We surprisingly observe that none of the existing methods could minimize incorrect labels' impact. Through a carefully designed experiment, we discover that the inefficacy mainly results from extreme data imbalance and the high percentage of incorrectly labeled data samples. As such, we further propose a new noise learning method and name it after MORSE. Unlike existing methods, MORSE customizes and extends a state-of-the-art semi-supervised learning technique. It takes possibly incorrectly labeled data as unlabeled data and thus avoids their potential negative impact on model learning. In MORSE, we also integrate a sample re-weighting method that balances the training data usage in the model learning and thus handles the data imbalance challenge. We evaluate MORSE on both our synthesized and real-world datasets. We show that MORSE could significantly outperform existing noise learning methods and minimize the impact of incorrectly labeled data.

Original languageEnglish (US)
Title of host publicationProceedings - 44th IEEE Symposium on Security and Privacy, SP 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2602-2619
Number of pages18
ISBN (Electronic)9781665493369
DOIs
StatePublished - 2023
Event44th IEEE Symposium on Security and Privacy, SP 2023 - Hybrid, San Francisco, United States
Duration: May 22 2023May 25 2023

Publication series

NameProceedings - IEEE Symposium on Security and Privacy
Volume2023-May
ISSN (Print)1081-6011

Conference

Conference44th IEEE Symposium on Security and Privacy, SP 2023
Country/TerritoryUnited States
CityHybrid, San Francisco
Period5/22/235/25/23

All Science Journal Classification (ASJC) codes

  • Safety, Risk, Reliability and Quality
  • Software
  • Computer Networks and Communications

Cite this