Topic segmentation with shared topic detection and alignment of multiple documents

Bingjun Sun, Prasenjit Mitra, C. Lee Giles, John Yen, Hongyuan Zha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

36 Scopus citations

Abstract

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

Original languageEnglish (US)
Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Pages199-206
Number of pages8
DOIs
StatePublished - 2007
Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam, Netherlands
Duration: Jul 23 2007Jul 27 2007

Publication series

NameProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07

Other

Other30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Country/TerritoryNetherlands
CityAmsterdam
Period7/23/077/27/07

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Software
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Topic segmentation with shared topic detection and alignment of multiple documents'. Together they form a unique fingerprint.

Cite this