Global-and-local aware data generation for the class imbalance problem

Wentao Wang, Suhang Wang, Wenqi Fan, Zitao Liu, Jiliang Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Scopus citations

Abstract

In many real-world classification applications such as fake news detection, the training data can be extremely imbalanced, which brings challenges to existing classifiers as the majority classes dominate the loss functions of classifiers. Oversampling techniques such as SMOTE are effective approaches to tackle the class imbalance problem by producing more synthetic minority samples. Despite their success, the majority of existing oversampling methods only consider local data distributions when generating minority samples, which can result in noisy minority samples that do not fit global data distributions or interleave with majority classes. Hence, in this paper, we study the class imbalance problem by simultaneously exploring local and global data information since: (i) the local data distribution could give detailed information for generating minority samples; and (ii) the global data distribution could provide guidance to avoid generating outliers or samples that interleave with majority classes. Specifically, we propose a novel framework GL-GAN, which leverages the SMOTE method to explore local distribution in a learned latent space and employs GAN to capture the global information, so that synthetic minority samples can be generated under even extremely imbalanced scenarios. Experimental results on diverse real data sets demonstrate the effectiveness of our GL-GAN framework in producing realistic and discriminative minority samples for improving the classification performance of various classifiers on imbalanced training data. Our code is available at https://github.com/wentao-repo/GL-GAN.

Original languageEnglish (US)
Title of host publicationProceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020
EditorsCarlotta Demeniconi, Nitesh Chawla
PublisherSociety for Industrial and Applied Mathematics Publications
Pages307-315
Number of pages9
ISBN (Electronic)9781611976236
DOIs
StatePublished - 2020
Event2020 SIAM International Conference on Data Mining, SDM 2020 - Cincinnati, United States
Duration: May 7 2020May 9 2020

Publication series

NameProceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020

Conference

Conference2020 SIAM International Conference on Data Mining, SDM 2020
Country/TerritoryUnited States
CityCincinnati
Period5/7/205/9/20

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Global-and-local aware data generation for the class imbalance problem'. Together they form a unique fingerprint.

Cite this