TY - JOUR
T1 - Classification of helical polymers with deep-learning language models
AU - Li, Daoyi
AU - Jiang, Wen
N1 - Publisher Copyright:
© 2023 Elsevier Inc.
PY - 2023/12
Y1 - 2023/12
N2 - Many macromolecules in biological systems exist in the form of helical polymers. However, the inherent polymorphism and heterogeneity of samples complicate the reconstruction of helical polymers from cryo-EM images. Currently, available 2D classification methods are effective at separating particles of interest from contaminants, but they do not effectively differentiate between polymorphs, resulting in heterogeneity in the 2D classes. As such, it is crucial to develop a method that can computationally divide a dataset of polymorphic helical structures into homogenous subsets. In this work, we utilized deep-learning language models to embed the filaments as vectors in hyperspace and group them into clusters. Tests with both simulated and experimental datasets have demonstrated that our method – HLM (Helical classification with Language Model) can effectively distinguish different types of filaments, in the presence of many contaminants and low signal-to-noise ratios. We also demonstrate that HLM can isolate homogeneous subsets of particles from a publicly available dataset, resulting in the discovery of a previously unreported filament variant with an extra density around the tau filaments.
AB - Many macromolecules in biological systems exist in the form of helical polymers. However, the inherent polymorphism and heterogeneity of samples complicate the reconstruction of helical polymers from cryo-EM images. Currently, available 2D classification methods are effective at separating particles of interest from contaminants, but they do not effectively differentiate between polymorphs, resulting in heterogeneity in the 2D classes. As such, it is crucial to develop a method that can computationally divide a dataset of polymorphic helical structures into homogenous subsets. In this work, we utilized deep-learning language models to embed the filaments as vectors in hyperspace and group them into clusters. Tests with both simulated and experimental datasets have demonstrated that our method – HLM (Helical classification with Language Model) can effectively distinguish different types of filaments, in the presence of many contaminants and low signal-to-noise ratios. We also demonstrate that HLM can isolate homogeneous subsets of particles from a publicly available dataset, resulting in the discovery of a previously unreported filament variant with an extra density around the tau filaments.
UR - http://www.scopus.com/inward/record.url?scp=85176573990&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85176573990&partnerID=8YFLogxK
U2 - 10.1016/j.jsb.2023.108041
DO - 10.1016/j.jsb.2023.108041
M3 - Article
C2 - 37939748
AN - SCOPUS:85176573990
SN - 1047-8477
VL - 215
JO - Journal of Structural Biology
JF - Journal of Structural Biology
IS - 4
M1 - 108041
ER -