TY - JOUR
T1 - Robust Variable Selection with Optimality Guarantees for High-Dimensional Logistic Regression
AU - Insolia, Luca
AU - Kenney, Ana
AU - Calovi, Martina
AU - Chiaromonte, Francesca
N1 - Publisher Copyright:
© 2021 by the authors.
PY - 2021/9
Y1 - 2021/9
N2 - High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of (Formula presented.) -constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings.
AB - High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of (Formula presented.) -constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings.
UR - http://www.scopus.com/inward/record.url?scp=85128213960&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128213960&partnerID=8YFLogxK
U2 - 10.3390/stats4030040
DO - 10.3390/stats4030040
M3 - Article
AN - SCOPUS:85128213960
SN - 2571-905X
VL - 4
SP - 665
EP - 681
JO - Stats
JF - Stats
IS - 3
ER -