TY - GEN
T1 - Reducing noise in labels and features for a real world dataset
T2 - 10th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2009
AU - Passonneau, Rebecca J.
AU - Rudin, Cynthia
AU - Radeva, Axinia
AU - Liu, Zhi An
PY - 2009
Y1 - 2009
N2 - This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.
AB - This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.
UR - http://www.scopus.com/inward/record.url?scp=67650535515&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67650535515&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-00382-0_7
DO - 10.1007/978-3-642-00382-0_7
M3 - Conference contribution
AN - SCOPUS:67650535515
SN - 3642003818
SN - 9783642003813
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 86
EP - 97
BT - Computational Linguistics and Intelligent Text Processing - 10th International Conference, CICLing 2009, Proceedings
Y2 - 1 March 2009 through 7 March 2009
ER -