Discovering Protein Function Classification Rules From Reduced Alphabet Representations of Protein Sequences

Carson M. Andorf, Drena L. Dobbs, Vasant G. Honavar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

The paper explores the use of reduced alphabet representations of protein sequences in the data-driven discovery of data-driven discovery of sequence motif-based decision trees for classifying protein sequences into functional families. A number of alternative representations of protein sequences (using a variety of reduced alphabets based on groupings of amino acids in terms of their physico -chemical properties were explored in addition to the 20-letter amino acid alphabet. Classifiers were constructed using motifs generated using a multiple sequence alignment based motif discovery tool (MEME). Results of experiments on a data set of eleven protease families show that the classification performance of the resulting decision trees based on several reduced alphabets (e.g., a 7-letter alphabet based on groupings of amino acids based on their mass and charge, a 5-letter alphabet based on a random grouping of the 20 amino acids into 5 groups) is comparable to that of trees based on the 20-letter amino acid alphabet. The results also show that the sequence motifs based on different alphabets capture regularities in different portions of the sequences. This raises the possibility that the use of different alphabets might provide different, but complementary insights into protein structure-function relationships.

Original languageEnglish (US)
Title of host publicationProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002
EditorsJ.H. Caulfield, S.H. Chen, H.D. Cheng, R. Duro, J.H. Caufield, S.H. Chen, H.D. Cheng, R. Duro, V. Honavar
Pages1200-1206
Number of pages7
StatePublished - 2002
EventProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002 - Research Triange Park, NC, United States
Duration: Mar 8 2002Mar 13 2002

Publication series

NameProceedings of the Joint Conference on Information Sciences
Volume6

Other

OtherProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002
Country/TerritoryUnited States
CityResearch Triange Park, NC
Period3/8/023/13/02

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'Discovering Protein Function Classification Rules From Reduced Alphabet Representations of Protein Sequences'. Together they form a unique fingerprint.

Cite this