TY - GEN
T1 - Beyond word2vec
T2 - 28th ACM International Conference on Information and Knowledge Management, CIKM 2019
AU - Wang, Suhang
AU - Aggarwal, Charu
AU - Liu, Huan
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/11/3
Y1 - 2019/11/3
N2 - The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.
AB - The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.
UR - http://www.scopus.com/inward/record.url?scp=85075438955&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075438955&partnerID=8YFLogxK
U2 - 10.1145/3357384.3358051
DO - 10.1145/3357384.3358051
M3 - Conference contribution
AN - SCOPUS:85075438955
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1041
EP - 1050
BT - CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 3 November 2019 through 7 November 2019
ER -