TY - JOUR
T1 - Better Naive Bayes classification for high-precision spam detection
AU - Song, Yang
AU - Kołcz, Aleksander
AU - Gilez, C. Lee
PY - 2009/8/10
Y1 - 2009/8/10
N2 - Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.
AB - Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date.
UR - http://www.scopus.com/inward/record.url?scp=67650834914&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67650834914&partnerID=8YFLogxK
U2 - 10.1002/spe.925
DO - 10.1002/spe.925
M3 - Article
AN - SCOPUS:67650834914
SN - 0038-0644
VL - 39
SP - 1003
EP - 1024
JO - Software - Practice and Experience
JF - Software - Practice and Experience
IS - 11
ER -