TY - JOUR
T1 - Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages
AU - Wang, Chen
AU - Markus, Havell
AU - Diwadkar, Avantika R.
AU - Khunsriraksakul, Chachrit
AU - Carrel, Laura
AU - Li, Bingshan
AU - Zhong, Xue
AU - Wang, Xingyan
AU - Zhan, Xiaowei
AU - Foulke, Galen T.
AU - Olsen, Nancy
AU - Liu, Dajiang
AU - Jiang, Bibo
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2025/12
Y1 - 2025/12
N2 - Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction R2 and the resulting PRS yields the strongest correlation with progression prevalence.
AB - Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest prediction R2 and the resulting PRS yields the strongest correlation with progression prevalence.
UR - http://www.scopus.com/inward/record.url?scp=85214025177&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85214025177&partnerID=8YFLogxK
U2 - 10.1038/s41467-024-55636-6
DO - 10.1038/s41467-024-55636-6
M3 - Article
C2 - 39747168
AN - SCOPUS:85214025177
SN - 2041-1723
VL - 16
JO - Nature communications
JF - Nature communications
IS - 1
M1 - 180
ER -