TY - GEN
T1 - Rare disease prediction by generating quality-assured electronic health records*
AU - Ma, Fenglong
AU - Wang, Yaqing
AU - Gao, Jing
AU - Xiao, Houping
AU - Zhou, Jing
N1 - Publisher Copyright:
© 2020 by SIAM.
PY - 2020
Y1 - 2020
N2 - Predicting diseases for patients is an important and practical task in healthcare informatics. Existing disease prediction models focus on common diseases, i.e., there are enough available EHR data and prior medical knowledge for analyzing them. However, those models may not work for rare disease prediction as it is extremely hard to collect enough EHR data with such diseases. To tackle these issues, in this paper, we design a novel rare disease prediction system, which not only generates EHR data but also automatically selects high-quality generated data to further improve the predictive performance. Three components are designed in the system: data generation, data selection, and prediction. In particular, we propose MaskEHR to generate diverse EHR data based on the data from patients suffering from the given diseases. To remove noise information in the generated EHR data, we further design a reinforcement learning-based data selector, called RL-Selector, which can automatically choose the high-quality generated EHR data. Finally, the prediction component is used to identify patients who will potentially suffer the given diseases. These three components work together and enhance each other. Experiments on three real healthcare datasets show that the proposed system outperforms existing approaches on rare disease prediction task.
AB - Predicting diseases for patients is an important and practical task in healthcare informatics. Existing disease prediction models focus on common diseases, i.e., there are enough available EHR data and prior medical knowledge for analyzing them. However, those models may not work for rare disease prediction as it is extremely hard to collect enough EHR data with such diseases. To tackle these issues, in this paper, we design a novel rare disease prediction system, which not only generates EHR data but also automatically selects high-quality generated data to further improve the predictive performance. Three components are designed in the system: data generation, data selection, and prediction. In particular, we propose MaskEHR to generate diverse EHR data based on the data from patients suffering from the given diseases. To remove noise information in the generated EHR data, we further design a reinforcement learning-based data selector, called RL-Selector, which can automatically choose the high-quality generated EHR data. Finally, the prediction component is used to identify patients who will potentially suffer the given diseases. These three components work together and enhance each other. Experiments on three real healthcare datasets show that the proposed system outperforms existing approaches on rare disease prediction task.
UR - http://www.scopus.com/inward/record.url?scp=85089184370&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089184370&partnerID=8YFLogxK
U2 - 10.1137/1.9781611976236.58
DO - 10.1137/1.9781611976236.58
M3 - Conference contribution
AN - SCOPUS:85089184370
T3 - Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020
SP - 514
EP - 522
BT - Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020
A2 - Demeniconi, Carlotta
A2 - Chawla, Nitesh
PB - Society for Industrial and Applied Mathematics Publications
T2 - 2020 SIAM International Conference on Data Mining, SDM 2020
Y2 - 7 May 2020 through 9 May 2020
ER -