Semi-supervised multi-label topic models for document classification and sentence labeling

Hossein Soleimani, David J. Miller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Scopus citations

Abstract

Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each sentence is associated with only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. For training, we use both labeled documents, and, typically, a larger set of unlabeled documents. Our model, in a semisupervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for new (or unlabeled) documents, and determines label associations for each sentence in every document. For learning, our model does not require any ground-truth labels on sentences. We develop a Hamil-tonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. Our experiments show that our approach outperforms several benchmark methods with respect to both document and sentence-level classification, as well as test set log-likelihood. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.

Original languageEnglish (US)
Title of host publicationCIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages105-114
Number of pages10
ISBN (Electronic)9781450340731
DOIs
StatePublished - Oct 24 2016
Event25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States
Duration: Oct 24 2016Oct 28 2016

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings
Volume24-28-October-2016

Other

Other25th ACM International Conference on Information and Knowledge Management, CIKM 2016
Country/TerritoryUnited States
CityIndianapolis
Period10/24/1610/28/16

All Science Journal Classification (ASJC) codes

  • General Decision Sciences
  • General Business, Management and Accounting

Fingerprint

Dive into the research topics of 'Semi-supervised multi-label topic models for document classification and sentence labeling'. Together they form a unique fingerprint.

Cite this