Statistical Unigram Analysis for Source Code Repository

Weifeng Xu, Dianxiang Xu, Omar El Ariss, Yunkai Liu, Abdulrahman Alatawi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-8
Number of pages8
ISBN (Electronic)9781509065493
DOIs
StatePublished - Jun 30 2017
Event3rd IEEE International Conference on Multimedia Big Data, BigMM 2017 - Laguna Hills, United States
Duration: Apr 19 2017Apr 21 2017

Publication series

NameProceedings - 2017 IEEE 3rd International Conference on Multimedia Big Data, BigMM 2017

Other

Other3rd IEEE International Conference on Multimedia Big Data, BigMM 2017
Country/TerritoryUnited States
CityLaguna Hills
Period4/19/174/21/17

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing
  • Media Technology

Fingerprint

Dive into the research topics of 'Statistical Unigram Analysis for Source Code Repository'. Together they form a unique fingerprint.

Cite this