TY - GEN
T1 - Topic segmentation with shared topic detection and alignment of multiple documents
AU - Sun, Bingjun
AU - Mitra, Prasenjit
AU - Giles, C. Lee
AU - Yen, John
AU - Zha, Hongyuan
PY - 2007
Y1 - 2007
N2 - Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.
AB - Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.
UR - http://www.scopus.com/inward/record.url?scp=36448956401&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36448956401&partnerID=8YFLogxK
U2 - 10.1145/1277741.1277778
DO - 10.1145/1277741.1277778
M3 - Conference contribution
AN - SCOPUS:36448956401
SN - 1595935975
SN - 9781595935977
T3 - Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
SP - 199
EP - 206
BT - Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
T2 - 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Y2 - 23 July 2007 through 27 July 2007
ER -