TY - GEN
T1 - Semi-supervised multi-label topic models for document classification and sentence labeling
AU - Soleimani, Hossein
AU - Miller, David J.
N1 - Publisher Copyright:
© 2016 Copyright held by the owner/author(s).
PY - 2016/10/24
Y1 - 2016/10/24
N2 - Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each sentence is associated with only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. For training, we use both labeled documents, and, typically, a larger set of unlabeled documents. Our model, in a semisupervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for new (or unlabeled) documents, and determines label associations for each sentence in every document. For learning, our model does not require any ground-truth labels on sentences. We develop a Hamil-tonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. Our experiments show that our approach outperforms several benchmark methods with respect to both document and sentence-level classification, as well as test set log-likelihood. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.
AB - Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each sentence is associated with only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. For training, we use both labeled documents, and, typically, a larger set of unlabeled documents. Our model, in a semisupervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for new (or unlabeled) documents, and determines label associations for each sentence in every document. For learning, our model does not require any ground-truth labels on sentences. We develop a Hamil-tonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. Our experiments show that our approach outperforms several benchmark methods with respect to both document and sentence-level classification, as well as test set log-likelihood. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.
UR - http://www.scopus.com/inward/record.url?scp=84996593706&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84996593706&partnerID=8YFLogxK
U2 - 10.1145/2983323.2983752
DO - 10.1145/2983323.2983752
M3 - Conference contribution
AN - SCOPUS:84996593706
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 105
EP - 114
BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 25th ACM International Conference on Information and Knowledge Management, CIKM 2016
Y2 - 24 October 2016 through 28 October 2016
ER -