Abstract
Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to.
Original language | English (US) |
---|---|
Pages | 831-836 |
Number of pages | 6 |
State | Published - 2006 |
Event | 5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy Duration: May 22 2006 → May 28 2006 |
Other
Other | 5th International Conference on Language Resources and Evaluation, LREC 2006 |
---|---|
Country/Territory | Italy |
City | Genoa |
Period | 5/22/06 → 5/28/06 |
All Science Journal Classification (ASJC) codes
- Education
- Library and Information Sciences
- Linguistics and Language
- Language and Linguistics