TY - JOUR
T1 - Synthesis Success Calculator
T2 - Predicting the Rapid Synthesis of DNA Fragments with Machine Learning
AU - Halper, Sean M.
AU - Hossain, Ayaan
AU - Salis, Howard M.
N1 - Publisher Copyright:
Copyright © 2020 American Chemical Society.
PY - 2020/7/17
Y1 - 2020/7/17
N2 - The synthesis and assembly of long DNA fragments has greatly accelerated synthetic biology and biotechnology research. However, long turnaround times or synthesis failures create unpredictable bottlenecks in the design-build-test-learn cycle. We developed a machine learning model, called the Synthesis Success Calculator, to predict whether a long DNA fragment can be readily synthesized with a short turnaround time. The model also identifies the sequence determinants associated with the synthesis outcome. We trained a random forest classifier using biophysical features and a compiled data set of 1076 DNA fragment sequences to achieve high predictive performance (F1 score of 0.928 on 251 unseen sequences). Feature importance analysis revealed that repetitive DNA sequences were the most important contributor to synthesis failures. We then applied the Synthesis Success Calculator across large sequence data sets and found that 84.9% of the Escherichia coli MG1655 genome, but only 34.4% of sampled plasmids in NCBI, could be readily synthesized. Overall, the Synthesis Success Calculator can be applied on its own to prevent synthesis failures or embedded within optimization algorithms to design large genetic systems that can be rapidly synthesized and assembled.
AB - The synthesis and assembly of long DNA fragments has greatly accelerated synthetic biology and biotechnology research. However, long turnaround times or synthesis failures create unpredictable bottlenecks in the design-build-test-learn cycle. We developed a machine learning model, called the Synthesis Success Calculator, to predict whether a long DNA fragment can be readily synthesized with a short turnaround time. The model also identifies the sequence determinants associated with the synthesis outcome. We trained a random forest classifier using biophysical features and a compiled data set of 1076 DNA fragment sequences to achieve high predictive performance (F1 score of 0.928 on 251 unseen sequences). Feature importance analysis revealed that repetitive DNA sequences were the most important contributor to synthesis failures. We then applied the Synthesis Success Calculator across large sequence data sets and found that 84.9% of the Escherichia coli MG1655 genome, but only 34.4% of sampled plasmids in NCBI, could be readily synthesized. Overall, the Synthesis Success Calculator can be applied on its own to prevent synthesis failures or embedded within optimization algorithms to design large genetic systems that can be rapidly synthesized and assembled.
UR - http://www.scopus.com/inward/record.url?scp=85088270223&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088270223&partnerID=8YFLogxK
U2 - 10.1021/acssynbio.9b00460
DO - 10.1021/acssynbio.9b00460
M3 - Article
C2 - 32559378
AN - SCOPUS:85088270223
SN - 2161-5063
VL - 9
SP - 1563
EP - 1571
JO - ACS Synthetic Biology
JF - ACS Synthetic Biology
IS - 7
ER -