TY - GEN
T1 - OpenResume
T2 - 2024 IEEE International Conference on Big Data, BigData 2024
AU - Yamashita, Michiharu
AU - Tran, Thanh
AU - Lee, Dongwon
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Despite substantial advancements in various fields of AI, computational research in career and job domains has been significantly hindered by a critical lack of accessible datasets. This limitation is mainly due to the proprietary nature of job platforms, which restrict the sharing of job-domain datasets with the research community. The scarcity is particularly pronounced for career trajectory and resume datasets, severely constraining academic researchers in developing and evaluating new models. In this paper, we address the crucial issue of resume dataset unavailability in the job domain, identified through our comprehensive comparison of existing job-domain machine learning studies. To the best of our knowledge, we introduce OpenResume, the first publicly available, anonymized, and structured resume dataset, specifically designed for job-domain downstream tasks. This dataset aims to catalyze advancements in AI and foster new markets for machine learning and data science within career trajectory modeling. OpenResume is comprehensively processed from real-world resume data. We anonymize and substitute personal identifiers and company names, normalize job titles into ESCO-based ones (i.e., one of the most common occupation taxonomies), and employ differential privacy techniques on temporal features to ensure open accessibility and privacy protection. Additionally, we augment OpenResume with a synthetically generated resume dataset derived from the post-processed real-world data, extending its diversity and utility. To demonstrate that OpenResume retains challenges and properties similar to real-world job datasets, we benchmark OpenResume on state-of-the-art job-domain prediction models across four prevalent downstream tasks: (1) next job title prediction, (2) next company prediction, (3) turnover prediction, and (4) link prediction. Our experimental results show that these job-domain models perform comparably on OpenResume and the original data across all tasks, demonstrating OpenResume as a valuable career trajectory dataset for both academic research and practical applications. We also indicate the OpenResume applicability for the other eight downstream tasks. Our datasets are available at: https://tinyurl.com/OpenResumeData.
AB - Despite substantial advancements in various fields of AI, computational research in career and job domains has been significantly hindered by a critical lack of accessible datasets. This limitation is mainly due to the proprietary nature of job platforms, which restrict the sharing of job-domain datasets with the research community. The scarcity is particularly pronounced for career trajectory and resume datasets, severely constraining academic researchers in developing and evaluating new models. In this paper, we address the crucial issue of resume dataset unavailability in the job domain, identified through our comprehensive comparison of existing job-domain machine learning studies. To the best of our knowledge, we introduce OpenResume, the first publicly available, anonymized, and structured resume dataset, specifically designed for job-domain downstream tasks. This dataset aims to catalyze advancements in AI and foster new markets for machine learning and data science within career trajectory modeling. OpenResume is comprehensively processed from real-world resume data. We anonymize and substitute personal identifiers and company names, normalize job titles into ESCO-based ones (i.e., one of the most common occupation taxonomies), and employ differential privacy techniques on temporal features to ensure open accessibility and privacy protection. Additionally, we augment OpenResume with a synthetically generated resume dataset derived from the post-processed real-world data, extending its diversity and utility. To demonstrate that OpenResume retains challenges and properties similar to real-world job datasets, we benchmark OpenResume on state-of-the-art job-domain prediction models across four prevalent downstream tasks: (1) next job title prediction, (2) next company prediction, (3) turnover prediction, and (4) link prediction. Our experimental results show that these job-domain models perform comparably on OpenResume and the original data across all tasks, demonstrating OpenResume as a valuable career trajectory dataset for both academic research and practical applications. We also indicate the OpenResume applicability for the other eight downstream tasks. Our datasets are available at: https://tinyurl.com/OpenResumeData.
UR - https://www.scopus.com/pages/publications/85215485014
UR - https://www.scopus.com/pages/publications/85215485014#tab=citedBy
U2 - 10.1109/BigData62323.2024.10825519
DO - 10.1109/BigData62323.2024.10825519
M3 - Conference contribution
AN - SCOPUS:85215485014
T3 - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
SP - 6697
EP - 6706
BT - Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024
A2 - Ding, Wei
A2 - Lu, Chang-Tien
A2 - Wang, Fusheng
A2 - Di, Liping
A2 - Wu, Kesheng
A2 - Huan, Jun
A2 - Nambiar, Raghu
A2 - Li, Jundong
A2 - Ilievski, Filip
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 December 2024 through 18 December 2024
ER -