TY - GEN
T1 - Detecting research topics via the correlation between graphs and texts
AU - Jo, Yookyung
AU - Lagoze, Carl
AU - Giles, C. Lee
PY - 2007
Y1 - 2007
N2 - In this paper we address the problem of detecting topics in large-scale linked document collections. Recently, topic detection has become a very active area of research due to its utility for information navigation, trend analysis, and high-level description of data. We present a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term. This tight coupling between term and graph analysis is distinguished from other approaches such as those that focus on language models. We develop a topic score measure for each term, using the likelihood ratio of binary hypotheses based on a probabilistic description of graph connectivity. Our approach is based on the intuition that if a term is relevant to a topic, the documents containing the term have denser connectivity than a random selection of documents. We extend our algorithm to detect a topic represented by a set of terms, using the intuition that if the co-occurrence of terms represents a new topic, the citation pattern should exhibit the synergistic effect. We test our algorithm on two electronic research literature collections,arXiv and Citeseer.Our evaluation shows that the approach is effective and reveals some novel aspects of topic detection.
AB - In this paper we address the problem of detecting topics in large-scale linked document collections. Recently, topic detection has become a very active area of research due to its utility for information navigation, trend analysis, and high-level description of data. We present a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term. This tight coupling between term and graph analysis is distinguished from other approaches such as those that focus on language models. We develop a topic score measure for each term, using the likelihood ratio of binary hypotheses based on a probabilistic description of graph connectivity. Our approach is based on the intuition that if a term is relevant to a topic, the documents containing the term have denser connectivity than a random selection of documents. We extend our algorithm to detect a topic represented by a set of terms, using the intuition that if the co-occurrence of terms represents a new topic, the citation pattern should exhibit the synergistic effect. We test our algorithm on two electronic research literature collections,arXiv and Citeseer.Our evaluation shows that the approach is effective and reveals some novel aspects of topic detection.
UR - http://www.scopus.com/inward/record.url?scp=36849020034&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36849020034&partnerID=8YFLogxK
U2 - 10.1145/1281192.1281234
DO - 10.1145/1281192.1281234
M3 - Conference contribution
AN - SCOPUS:36849020034
SN - 1595936092
SN - 9781595936097
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 370
EP - 379
BT - KDD-2007
T2 - KDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 12 August 2007 through 15 August 2007
ER -