TY - GEN
T1 - Relation between agreement measures on human labeling and machine learning performance
T2 - 6th International Conference on Language Resources and Evaluation, LREC 2008
AU - Passonneau, Rebecca J.
AU - Lippincott, Tom
AU - Yano, Tae
AU - Klavans, Judith
N1 - Funding Information:
This work was conducted as part of the project, Computational Linguistics for Metadata Building (CLiMB), supported by an award from the Mellon Foundation to the University of Maryland. The authors extend enormous thanks to Roberta Blitz, David Elson, Angela Giral, and Dustin Weese, who helped provide expert feedback on the functional categories, on our initial labeling interface and labeling guidelines, and on the issue of consistency among image catalogers. We also thank the annotators who helped label the pilot dataset: James Masciuch, Adam Goodkind, Justin Cranshaw and another we know only as Ginger.
PY - 2008
Y1 - 2008
N2 - We discuss factors that affect human agreement on a semantic labeling task in the art history domain, based on the results of four experiments where we varied the number of labels annotators could assign, the number of annotators, the type and amount of training they received, and the size of the text span being labeled. Using the labelings from one experiment involving seven annotators, we investigate the relation between interannotator agreement and machine learning performance. We construct binary classifiers and vary the training and test data by swapping the labelings from the seven annotators. First, we find performance is often quite good despite lower than recommended interannotator agreement. Second, we find that on average, learning performance for a given functional semantic category correlates with the overall agreement among the seven annotators for that category. Third, we find that learning performance on the data from a given annotator does not correlate with the quality of that annotator's labeling. We offer recommendations for the use of labeled data in machine learning, and argue that learners should attempt to accommodate human variation. We also note implications for large scale corpus annotation projects that deal with similarly subjective phenomena.
AB - We discuss factors that affect human agreement on a semantic labeling task in the art history domain, based on the results of four experiments where we varied the number of labels annotators could assign, the number of annotators, the type and amount of training they received, and the size of the text span being labeled. Using the labelings from one experiment involving seven annotators, we investigate the relation between interannotator agreement and machine learning performance. We construct binary classifiers and vary the training and test data by swapping the labelings from the seven annotators. First, we find performance is often quite good despite lower than recommended interannotator agreement. Second, we find that on average, learning performance for a given functional semantic category correlates with the overall agreement among the seven annotators for that category. Third, we find that learning performance on the data from a given annotator does not correlate with the quality of that annotator's labeling. We offer recommendations for the use of labeled data in machine learning, and argue that learners should attempt to accommodate human variation. We also note implications for large scale corpus annotation projects that deal with similarly subjective phenomena.
UR - http://www.scopus.com/inward/record.url?scp=84880379979&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880379979&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84880379979
T3 - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
SP - 2841
EP - 2848
BT - Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008
PB - European Language Resources Association (ELRA)
Y2 - 28 May 2008 through 30 May 2008
ER -