A generalized topic modeling approach for automatic document annotation

Suppawong Tuarob, Line C. Pouchard, Prasenjit Mitra, C. Lee Giles

Research output: Contribution to journalArticlepeer-review

32 Scopus citations

Abstract

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

Original languageEnglish (US)
Pages (from-to)111-128
Number of pages18
JournalInternational Journal on Digital Libraries
Volume16
Issue number2
DOIs
StatePublished - Jun 27 2015

All Science Journal Classification (ASJC) codes

  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'A generalized topic modeling approach for automatic document annotation'. Together they form a unique fingerprint.

Cite this