TY - GEN
T1 - Global-and-local aware data generation for the class imbalance problem
AU - Wang, Wentao
AU - Wang, Suhang
AU - Fan, Wenqi
AU - Liu, Zitao
AU - Tang, Jiliang
N1 - Funding Information:
Wentao Wang and Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers IIS1907704, IIS1714741, IIS1715940 and IIS1845081. Suhang Wang is supported by the National Science Foundation (NSF) under grant number IIS1909702.
Publisher Copyright:
Copyright © 2020 by SIAM.
PY - 2020
Y1 - 2020
N2 - In many real-world classification applications such as fake news detection, the training data can be extremely imbalanced, which brings challenges to existing classifiers as the majority classes dominate the loss functions of classifiers. Oversampling techniques such as SMOTE are effective approaches to tackle the class imbalance problem by producing more synthetic minority samples. Despite their success, the majority of existing oversampling methods only consider local data distributions when generating minority samples, which can result in noisy minority samples that do not fit global data distributions or interleave with majority classes. Hence, in this paper, we study the class imbalance problem by simultaneously exploring local and global data information since: (i) the local data distribution could give detailed information for generating minority samples; and (ii) the global data distribution could provide guidance to avoid generating outliers or samples that interleave with majority classes. Specifically, we propose a novel framework GL-GAN, which leverages the SMOTE method to explore local distribution in a learned latent space and employs GAN to capture the global information, so that synthetic minority samples can be generated under even extremely imbalanced scenarios. Experimental results on diverse real data sets demonstrate the effectiveness of our GL-GAN framework in producing realistic and discriminative minority samples for improving the classification performance of various classifiers on imbalanced training data. Our code is available at https://github.com/wentao-repo/GL-GAN.
AB - In many real-world classification applications such as fake news detection, the training data can be extremely imbalanced, which brings challenges to existing classifiers as the majority classes dominate the loss functions of classifiers. Oversampling techniques such as SMOTE are effective approaches to tackle the class imbalance problem by producing more synthetic minority samples. Despite their success, the majority of existing oversampling methods only consider local data distributions when generating minority samples, which can result in noisy minority samples that do not fit global data distributions or interleave with majority classes. Hence, in this paper, we study the class imbalance problem by simultaneously exploring local and global data information since: (i) the local data distribution could give detailed information for generating minority samples; and (ii) the global data distribution could provide guidance to avoid generating outliers or samples that interleave with majority classes. Specifically, we propose a novel framework GL-GAN, which leverages the SMOTE method to explore local distribution in a learned latent space and employs GAN to capture the global information, so that synthetic minority samples can be generated under even extremely imbalanced scenarios. Experimental results on diverse real data sets demonstrate the effectiveness of our GL-GAN framework in producing realistic and discriminative minority samples for improving the classification performance of various classifiers on imbalanced training data. Our code is available at https://github.com/wentao-repo/GL-GAN.
UR - http://www.scopus.com/inward/record.url?scp=85089194568&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089194568&partnerID=8YFLogxK
U2 - 10.1137/1.9781611976236.35
DO - 10.1137/1.9781611976236.35
M3 - Conference contribution
AN - SCOPUS:85089194568
T3 - Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020
SP - 307
EP - 315
BT - Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020
A2 - Demeniconi, Carlotta
A2 - Chawla, Nitesh
PB - Society for Industrial and Applied Mathematics Publications
T2 - 2020 SIAM International Conference on Data Mining, SDM 2020
Y2 - 7 May 2020 through 9 May 2020
ER -