TY - JOUR
T1 - Protein sequence classification using feature hashing
AU - Caragea, Cornelia
AU - Silvescu, Adrian
AU - Mitra, Prasenjit
N1 - Funding Information:
This research was funded in part by an NSF grant #0845487 to Prasenjit Mitra.
Publisher Copyright:
© 2012 Caragea et al.
PY - 2012
Y1 - 2012
N2 - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
AB - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
UR - https://www.scopus.com/pages/publications/85011792680
UR - https://www.scopus.com/inward/citedby.url?scp=85011792680&partnerID=8YFLogxK
U2 - 10.1186/1477-5956-10-s1-s14
DO - 10.1186/1477-5956-10-s1-s14
M3 - Article
AN - SCOPUS:85011792680
SN - 1477-5956
VL - 10
JO - Proteome Science
JF - Proteome Science
M1 - S14
ER -