Annotating the MASC corpus with BabelNet

Andrea Moro, Roberto Navigli, Francesco Maria Tucci, Rebecca J. Passonneau

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Scopus citations

Abstract

In this paper we tackle the problem of automatically annotating, with both word senses and named entities, the MASC 3.0 corpus, a large English corpus covering a wide range of genres of written and spoken text. We use BabelNet 2.0, a multilingual semantic network which integrates both lexicographic and encyclopedic knowledge, as our sense/entity inventory together with its semantic structure, to perform the aforementioned annotation task. Word sense annotated corpora have been around for more than twenty years, helping the development of Word Sense Disambiguation algorithms by providing both training and testing grounds. More recently Entity Linking has followed the same path, with the creation of huge resources containing annotated named entities. However, to date, there has been no resource that contains both kinds of annotation. In this paper we present an automatic approach for performing this annotation, together with its output on the MASC corpus. We use this corpus because its goal of integrating different types of annotations goes exactly in our same direction. Our overall aim is to stimulate research on the joint exploitation and disambiguation of word senses and named entities. Finally, we estimate the quality of our annotations using both manually-tagged named entities and word senses, obtaining an accuracy of roughly 70% for both named entities and word sense annotations.

Original languageEnglish (US)
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
EditorsNicoletta Calzolari, Khalid Choukri, Sara Goggi, Thierry Declerck, Joseph Mariani, Bente Maegaard, Asuncion Moreno, Jan Odijk, Helene Mazo, Stelios Piperidis, Hrafn Loftsson
PublisherEuropean Language Resources Association (ELRA)
Pages4214-4219
Number of pages6
ISBN (Electronic)9782951740884
StatePublished - 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: May 26 2014May 31 2014

Publication series

NameProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
Country/TerritoryIceland
CityReykjavik
Period5/26/145/31/14

All Science Journal Classification (ASJC) codes

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Annotating the MASC corpus with BabelNet'. Together they form a unique fingerprint.

Cite this