TY - GEN
T1 - Scalable community discovery on textual data with relations
AU - Li, Huajing
AU - Nie, Zaiqing
AU - Lee, Wang Chien
AU - Giles, C. Lee
AU - Wen, Ji Rong
PY - 2008
Y1 - 2008
N2 - Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations.Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15% topical precision improvement compared with popular clustering techniques.
AB - Every piece of textual data is generated as a method to convey its authors' opinion regarding specific topics. Authors deliberately organize their writings and create links, i.e., references, acknowledgments, for better expression. Thereafter, it is of interest to study texts as well as their relations to understand the underlying topics and communities. Although many efforts exist in the literature in data clustering and topic mining, they are not applicable to community discovery on large document corpus for several reasons. First, few of them consider both textual attributes as well as relations.Second, scalability remains a significant issue for large-scale datasets. Additionally, most algorithms rely on a set of initial parameters that are hard to be captured and tuned. Motivated by the aforementioned observations, a hierarchical community model is proposed in the paper which distinguishes community cores from affiliated members. We present our efforts to develop a scalable community discovery solution for large-scale document corpus. Our proposal tries to quickly identify potential cores as seeds of communities through relation analysis. To eliminate the influence of initial parameters, an innovative attribute-based core merge process is introduced so that the algorithm promises to return consistent communities regardless initial parameters. Experimental results suggest that the proposed method has high scalability to corpus size and feature dimensionality, with more than 15% topical precision improvement compared with popular clustering techniques.
UR - http://www.scopus.com/inward/record.url?scp=70349260835&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70349260835&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458241
DO - 10.1145/1458082.1458241
M3 - Conference contribution
AN - SCOPUS:70349260835
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1203
EP - 1212
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -