Abstract
There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifier are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
Editors | D. Hand, D. Keim, R. Ng |
Pages | 632-638 |
Number of pages | 7 |
State | Published - 2002 |
Event | KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Edmonton, Alta, Canada Duration: Jul 23 2002 → Jul 26 2002 |
Other
Other | KDD - 2002 Proceedings of the Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |
---|---|
Country/Territory | Canada |
City | Edmonton, Alta |
Period | 7/23/02 → 7/26/02 |
All Science Journal Classification (ASJC) codes
- Information Systems