What's the code? Automatic classification of source code archives

Secil Ugurel, Robert Krovetz, C. Lee Giles, David M. Pennock, Eric J. Glover, Hongyuan Zha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

92 Scopus citations

Abstract

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifier are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsD. Hand, D. Keim, R. Ng
Pages632-638
Number of pages7
StatePublished - 2002
EventKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Edmonton, Alta, Canada
Duration: Jul 23 2002Jul 26 2002

Other

OtherKDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryCanada
CityEdmonton, Alta
Period7/23/027/26/02

All Science Journal Classification (ASJC) codes

  • Information Systems

Fingerprint

Dive into the research topics of 'What's the code? Automatic classification of source code archives'. Together they form a unique fingerprint.

Cite this