TY - GEN
T1 - Determining gains acquired from word embedding quantitatively using discrete distribution clustering
AU - Ye, Jianbo
AU - Li, Yanran
AU - Wu, Zhaohui
AU - Wang, James Z.
AU - Li, Wenjie
AU - Li, Jia
N1 - Publisher Copyright:
© 2017 Association for Computational Linguistics.
PY - 2017
Y1 - 2017
N2 - Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.
AB - Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.
UR - http://www.scopus.com/inward/record.url?scp=85040943836&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040943836&partnerID=8YFLogxK
U2 - 10.18653/v1/P17-1169
DO - 10.18653/v1/P17-1169
M3 - Conference contribution
AN - SCOPUS:85040943836
T3 - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 1847
EP - 1856
BT - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
Y2 - 30 July 2017 through 4 August 2017
ER -