TY - JOUR
T1 - Flexible imputation toolkit for electronic health records
AU - Vafaei Sadr, Alireza
AU - Li, Jiang
AU - Hwang, Wenke
AU - Yeasin, Mohammed
AU - Wang, Ming
AU - Lehmann, Harold
AU - Zand, Ramin
AU - Abedi, Vida
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Missing data in electronic health records (EHRs) poses a significant challenge for analysis. This study introduces Pympute, a comprehensive Python package designed for efficient and robust missing value imputation for EHRs. Pympute’s core algorithm, Flexible, intelligently selects the optimal imputation method for each variable based on its characteristics. Pympute offers a comprehensive suite of functionalities. It benchmarks the performance of ten existing machine learning imputation algorithms against Flexible on real-world EHR datasets containing laboratory measurements. Additionally, Pympute facilitates data simulation, generating realistic datasets mimicking real-world data distributions for controlled evaluation of imputation performance. Finally, Pympute investigates how missingness and skewness, influence the selection of optimal imputation algorithms within the Flexible framework. Our findings validate that Pympute’s Flexible method significantly improves imputation performance compared to the single model approach. Notably, simulating data solely based on covariance does not accurately reflect real-world selection behavior. Furthermore, skewness in the data distribution prompts Flexible to favor nonlinear imputation models. This study highlights the importance of considering data distribution patterns when selecting imputation algorithms. Pympute addresses this challenge by offering a versatile and user-friendly solution for diverse EHR data scenarios.
AB - Missing data in electronic health records (EHRs) poses a significant challenge for analysis. This study introduces Pympute, a comprehensive Python package designed for efficient and robust missing value imputation for EHRs. Pympute’s core algorithm, Flexible, intelligently selects the optimal imputation method for each variable based on its characteristics. Pympute offers a comprehensive suite of functionalities. It benchmarks the performance of ten existing machine learning imputation algorithms against Flexible on real-world EHR datasets containing laboratory measurements. Additionally, Pympute facilitates data simulation, generating realistic datasets mimicking real-world data distributions for controlled evaluation of imputation performance. Finally, Pympute investigates how missingness and skewness, influence the selection of optimal imputation algorithms within the Flexible framework. Our findings validate that Pympute’s Flexible method significantly improves imputation performance compared to the single model approach. Notably, simulating data solely based on covariance does not accurately reflect real-world selection behavior. Furthermore, skewness in the data distribution prompts Flexible to favor nonlinear imputation models. This study highlights the importance of considering data distribution patterns when selecting imputation algorithms. Pympute addresses this challenge by offering a versatile and user-friendly solution for diverse EHR data scenarios.
UR - https://www.scopus.com/pages/publications/105005419198
UR - https://www.scopus.com/inward/citedby.url?scp=105005419198&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-02276-5
DO - 10.1038/s41598-025-02276-5
M3 - Article
C2 - 40382465
AN - SCOPUS:105005419198
SN - 2045-2322
VL - 15
JO - Scientific reports
JF - Scientific reports
IS - 1
M1 - 17176
ER -