TY - JOUR
T1 - Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills
AU - Zhong, Wei
AU - Qian, Chen
AU - Liu, Wanjun
AU - Zhu, Liping
AU - Li, Runze
N1 - Funding Information:
Zhong’s research was supported by the National Natural Science Foundation of China (NSFC) grants (11922117, 12231011, 71988101), National Key R&D Program of China 2022YFA10038002 and National Statistical Science Research Program of China (2022LD08). Zhu’s research was supported by NSFC (12225113 and 12171477) and Renmin University of China (22XNA026). Li’s research was supported by National Science Foundation (NSF) DMS-1820702, and NIH grants R01AI136664 and R01AI170249. The content is solely the responsibility of the authors and does not necessarily represent the official views of NSFC, NSF, or NIH. We thank the Editor, the Associate Editor, and two referees for their insightful comments which have substantially improved the article.
Publisher Copyright:
© 2023 American Statistical Association.
PY - 2023
Y1 - 2023
N2 - It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this article, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k–10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China’s online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary. Supplementary materials for this article are available online.
AB - It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this article, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k–10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China’s online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary. Supplementary materials for this article are available online.
UR - http://www.scopus.com/inward/record.url?scp=85146222870&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85146222870&partnerID=8YFLogxK
U2 - 10.1080/01621459.2022.2152342
DO - 10.1080/01621459.2022.2152342
M3 - Article
C2 - 37448462
AN - SCOPUS:85146222870
SN - 0162-1459
VL - 118
SP - 805
EP - 817
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 542
ER -