TY - GEN
T1 - Feature selection in Web applications by ROC inflections and powerset pruning
AU - Coetzee, F. M.
AU - Glover, E.
AU - Lawrence, S.
AU - Giles, C. L.
N1 - Publisher Copyright:
© 2001 IEEE.
PY - 2001
Y1 - 2001
N2 - A basic problem of information processing is selecting enough features to ensure that events are accurately represented for classification problems, while simultaneously minimizing storage and processing of irrelevant or marginally important features. To address this problem, feature selection procedures perform a search through the feature power set to find the smallest subset meeting performance requirements. Major restrictions of existing procedures are that they typically explicitly or implicitly assume a fixed operating point, and make limited use of the statistical structure of the feature power set. We present a method that combines the Neyman-Pearson design procedure on finite data, with the directed set structure of the Receiver Operating Curves on the feature subsets, to determine the maximal size of the feature subsets that can be ranked in a given problem. The search can then be restricted to the smaller subsets, resulting in significant reductions in computational complexity. Optimizing the overall Receiver Operating Curve also allows for end users to select different operating points and cost functions to optimize. The algorithm also produces a natural method of Boolean representation of the minimal feature combinations that best describe the data near a given operating point. These representations are especially appropriate when describing data using common text-related features useful on the Web, such as thresholded TFIDF data. We show how to use these results to perform automatic Boolean query modification generation for distributed databases, such as niche metasearch engines.
AB - A basic problem of information processing is selecting enough features to ensure that events are accurately represented for classification problems, while simultaneously minimizing storage and processing of irrelevant or marginally important features. To address this problem, feature selection procedures perform a search through the feature power set to find the smallest subset meeting performance requirements. Major restrictions of existing procedures are that they typically explicitly or implicitly assume a fixed operating point, and make limited use of the statistical structure of the feature power set. We present a method that combines the Neyman-Pearson design procedure on finite data, with the directed set structure of the Receiver Operating Curves on the feature subsets, to determine the maximal size of the feature subsets that can be ranked in a given problem. The search can then be restricted to the smaller subsets, resulting in significant reductions in computational complexity. Optimizing the overall Receiver Operating Curve also allows for end users to select different operating points and cost functions to optimize. The algorithm also produces a natural method of Boolean representation of the minimal feature combinations that best describe the data near a given operating point. These representations are especially appropriate when describing data using common text-related features useful on the Web, such as thresholded TFIDF data. We show how to use these results to perform automatic Boolean query modification generation for distributed databases, such as niche metasearch engines.
UR - http://www.scopus.com/inward/record.url?scp=12244303929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=12244303929&partnerID=8YFLogxK
U2 - 10.1109/SAINT.2001.905163
DO - 10.1109/SAINT.2001.905163
M3 - Conference contribution
AN - SCOPUS:12244303929
T3 - Proceedings - 2001 Symposium on Applications and the Internet, SAINT 2001
SP - 5
EP - 14
BT - Proceedings - 2001 Symposium on Applications and the Internet, SAINT 2001
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - Symposium on Applications and the Internet, SAINT 2001
Y2 - 8 January 2001 through 12 January 2001
ER -