TY - GEN
T1 - Assessing the performance of macromolecular sequence classifiers
AU - Caragea, Cornelia
AU - Sinapov, Jivko
AU - Honavar, Vasant
AU - Dobbs, Drena
PY - 2007/12/1
Y1 - 2007/12/1
N2 - Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.
AB - Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.
UR - http://www.scopus.com/inward/record.url?scp=47649094232&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=47649094232&partnerID=8YFLogxK
U2 - 10.1109/BIBE.2007.4375583
DO - 10.1109/BIBE.2007.4375583
M3 - Conference contribution
AN - SCOPUS:47649094232
SN - 1424415098
SN - 9781424415090
T3 - Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
SP - 320
EP - 326
BT - Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
T2 - 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
Y2 - 14 January 2007 through 17 January 2007
ER -