TY - GEN
T1 - Towards noise-resilient document modeling
AU - Yang, Tao
AU - Lee, Dongwon
PY - 2011
Y1 - 2011
N2 - We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.
AB - We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.
UR - http://www.scopus.com/inward/record.url?scp=83055165905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83055165905&partnerID=8YFLogxK
U2 - 10.1145/2063576.2063962
DO - 10.1145/2063576.2063962
M3 - Conference contribution
AN - SCOPUS:83055165905
SN - 9781450307178
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 2345
EP - 2348
BT - CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management
T2 - 20th ACM Conference on Information and Knowledge Management, CIKM'11
Y2 - 24 October 2011 through 28 October 2011
ER -