Supervised machine learning methods to model word sense often rely on human labelers to provide a single, ground truth label for each word in its context. We examine issues in establishing ground truth word sense labels using a fine-grained sense inventory from WordNet. Our data consist of a sentence corpus of 1,000 sentences: 100 for each of ten moderately polysemous words. Each word was given multiple sense labels-or a multilabel-from trained and untrained annotators. The multilabels give a nuanced representation of the degree of agreement on instances. A suite of assessment metrics is used to analyze the sets of multilabels, such as comparisons of sense distributions across annotators. Our assessment indicates that the general annotation procedure is reliable, but that words differ regarding how reliably annotators can assign WordNet sense labels, independent of the number of senses. We also investigate the performance of an unsupervised machine learning method to infer ground truth labels from various combinations of labels from the trained and untrained annotators. We find tentative support for the hypothesis that performance depends on the quality of the set of multilabels, independent of the number of labelers or their training.
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Linguistics and Language
- Library and Information Sciences