Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

Myron Peto, Andrzej Kloczkowski, Vasant Honavar, Robert L. Jernigan

Research output: Contribution to journalArticlepeer-review

3 Scopus citations


Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations. Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly - or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms. Conclusion: By using these machine learning algorithms withten-fold cross-validation we are able to classify the two classes of sequences with high accuracy - in some cases exceeding 95%.

Original languageEnglish (US)
Article number487
JournalBMC bioinformatics
StatePublished - Nov 18 2008

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics


Dive into the research topics of 'Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable'. Together they form a unique fingerprint.

Cite this