TY - GEN
T1 - Statistical Unigram Analysis for Source Code Repository
AU - Xu, Weifeng
AU - Xu, Dianxiang
AU - El Ariss, Omar
AU - Liu, Yunkai
AU - Alatawi, Abdulrahman
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
AB - Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
UR - http://www.scopus.com/inward/record.url?scp=85027725617&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027725617&partnerID=8YFLogxK
U2 - 10.1109/BigMM.2017.13
DO - 10.1109/BigMM.2017.13
M3 - Conference contribution
AN - SCOPUS:85027725617
T3 - Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017
SP - 1
EP - 8
BT - Proceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE International Conference on Multimedia Big Data, BigMM 2017
Y2 - 19 April 2017 through 21 April 2017
ER -