TY - GEN
T1 - Protein sequence classification using feature hashing
AU - Caragea, Cornelia
AU - Silvescu, Adrian
AU - Mitra, Prasenjit
PY - 2011
Y1 - 2011
N2 - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
AB - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is reduced by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and aggregating their counts. We compare feature hashing with the bag of k-grams and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
UR - http://www.scopus.com/inward/record.url?scp=84856044057&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84856044057&partnerID=8YFLogxK
U2 - 10.1109/BIBM.2011.91
DO - 10.1109/BIBM.2011.91
M3 - Conference contribution
AN - SCOPUS:84856044057
SN - 9780769545745
T3 - Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011
SP - 538
EP - 543
BT - Proceedings - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011
T2 - 2011 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011
Y2 - 12 November 2011 through 15 November 2011
ER -