TY - JOUR
T1 - Distributed feature screening via componentwise debiasing
AU - Li, Xingxiang
AU - Li, Runze
AU - Xia, Zhiming
AU - Xu, Chen
N1 - Funding Information:
Xu’s research was supported by NSERC grant RGPIN-2016-05024 and NSFC grant 116900 14. R. Li’s research was supported by NSF grant DMS 1820702 and NIDA, NIH grant P50 DA039838. Xia’s research was supported by NSFC grant 11771353. X. Li’s research was supported by JSPSP grant 2019SJA2093. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agencies.
Funding Information:
Xu's research was supported by NSERC grant RGPIN-2016-05024 and NSFC grant 116900 14. R. Li's research was supported by NSF grant DMS 1820702 and NIDA, NIH grant P50 DA039838. Xia's research was supported by NSFC grant 11771353. X. Li's research was supported by JSPSP grant 2019SJA2093. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agencies.
Publisher Copyright:
© 2020 Xingxiang Li, Runze Li, Zhiming Xia, and Chen Xu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-537.html.
PY - 2020/2/1
Y1 - 2020/2/1
N2 - Feature screening is a powerful tool in processing high-dimensional data. When the sample size N and the number of features p are both large, the implementation of classic screening methods can be numerically challenging. In this paper, we propose a distributed screening framework for big data setup. In the spirit of “divide-and-conquer”, the proposed framework expresses a correlation measure as a function of several component parameters, each of which can be distributively estimated using a natural U-statistic from data segments. With the component estimates aggregated, we obtain a final correlation estimate that can be readily used for screening features. This framework enables distributed storage and parallel computing and thus is computationally attractive. Due to the unbiased distributive estimation of the component parameters, the final aggregated estimate achieves a high accuracy that is insensitive to the number of data segments m. Under mild conditions, we show that the aggregated correlation estimator is as efficient as the centralized estimator in terms of the probability convergence bound and the mean squared error rate; the corresponding screening procedure enjoys sure screening property for a wide range of correlation measures. The promising performances of the new method are supported by extensive numerical examples.
AB - Feature screening is a powerful tool in processing high-dimensional data. When the sample size N and the number of features p are both large, the implementation of classic screening methods can be numerically challenging. In this paper, we propose a distributed screening framework for big data setup. In the spirit of “divide-and-conquer”, the proposed framework expresses a correlation measure as a function of several component parameters, each of which can be distributively estimated using a natural U-statistic from data segments. With the component estimates aggregated, we obtain a final correlation estimate that can be readily used for screening features. This framework enables distributed storage and parallel computing and thus is computationally attractive. Due to the unbiased distributive estimation of the component parameters, the final aggregated estimate achieves a high accuracy that is insensitive to the number of data segments m. Under mild conditions, we show that the aggregated correlation estimator is as efficient as the centralized estimator in terms of the probability convergence bound and the mean squared error rate; the corresponding screening procedure enjoys sure screening property for a wide range of correlation measures. The promising performances of the new method are supported by extensive numerical examples.
UR - http://www.scopus.com/inward/record.url?scp=85086804354&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086804354&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85086804354
SN - 1532-4435
VL - 21
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
ER -