Abstract
Promoters drive gene expression and help regulate cellular responses to the environment. In recent research, machine learning models have been developed to predict a bacterial promoter’s transcriptional initiation rate, although these models utilize expert-labeled sequence elements across a defined set of DNA building blocks. The generalizability of these methods is therefore limited by the necessary labeling of the specific components studied. As a result, current models have not been used to predict the transcriptional initiation rates of promoters with generalized nucleotide sequences. If generalizable models existed, they could greatly facilitate the design of synthetic genetic circuits with well-controlled transcription rates in bacteria. To address these limitations, we used a convolutional neural network (CNN) to predict a promoter’s transcriptional initiation rate directly from its DNA nucleotide sequence. We first evaluated the model on a published promoter component dataset. Trained using only the sequence as input, our model fits held-out test data with R 2 = 0.90, comparable to published models that fit expert-labeled sequence elements. We produced a new promoter strength dataset including non-repetitive promoters with high sequence variation and not limited to combinations of discrete expert-labeled components. Our CNN trained on this more varied dataset fits held-out promoter strength with R 2 = 0.61. Previously-published models are intractable on a dataset like this with highly diverse inputs. The CNN outperforms classical approach baselines like LASSO on a bag of words for promoter sequence elements (R 2 = 0.42). We applied recent machine learning approaches to quantify the contribution of individual nucleotides to the CNN's promoter strength prediction. Learning directly from DNA sequence, our model identified the consensus -35 and -10 hexamer regions as well as the discriminator element as key contributors to σ 70 promoter strength. It also replicated a finding that a perfect consensus sequence match does not yield the strongest promoter. The model's ability to independently learn biologically-relevant information directly from sequence, while performing similarly to or better than classical methods, makes it appealing for further prediction optimization and research into generalizability. This approach may be useful for synthetic promoter design, as well as for sequence feature identification.
Original language | English (US) |
---|---|
Pages (from-to) | 163-172 |
Number of pages | 10 |
Journal | EPiC Series in Computing |
Volume | 70 |
DOIs | |
State | Published - Mar 11 2020 |
Event | 12th International Conference on Bioinformatics and Computational Biology, BICOB 2020 - San Francisco, United States Duration: Mar 23 2020 → Mar 25 2020 |
All Science Journal Classification (ASJC) codes
- General Computer Science