TY - GEN
T1 - Integration of static and dynamic code stylometry analysis for programmer de-anonymization
AU - Wang, Ningfei
AU - Ji, Shouling
AU - Wang, Ting
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/15
Y1 - 2018/10/15
N2 - De-anonymizing the authors of anonymous code (i.e., code stylometry) entails significant privacy and security implications. Most existing code stylometry methods solely rely on static (e.g., lexical, layout, and syntactic) features extracted from source code, while neglecting its key difference from regular text – it is executable! In this paper, we present Sundae, a novel code de-anonymization framework that integrates both static and dynamic stylometry analysis. Compared with the existing solutions, Sundae departs in significant ways: (i) it requires much less number of static, handcrafted features; (ii) it requires much less labeled data for training; and (iii) it can be readily extended to new programmers once their stylometry information becomes available. Through extensive evaluation on benchmark datasets, we demonstrate that Sundae delivers strong empirical performance. For example, under the setting of 229 programmers and 9 problems, it outperforms the state-of-art method by a margin of 45.65% on Python code de-anonymization. The empirical results highlight the integration of static and dynamic analysis as a promising direction for code stylometry research.
AB - De-anonymizing the authors of anonymous code (i.e., code stylometry) entails significant privacy and security implications. Most existing code stylometry methods solely rely on static (e.g., lexical, layout, and syntactic) features extracted from source code, while neglecting its key difference from regular text – it is executable! In this paper, we present Sundae, a novel code de-anonymization framework that integrates both static and dynamic stylometry analysis. Compared with the existing solutions, Sundae departs in significant ways: (i) it requires much less number of static, handcrafted features; (ii) it requires much less labeled data for training; and (iii) it can be readily extended to new programmers once their stylometry information becomes available. Through extensive evaluation on benchmark datasets, we demonstrate that Sundae delivers strong empirical performance. For example, under the setting of 229 programmers and 9 problems, it outperforms the state-of-art method by a margin of 45.65% on Python code de-anonymization. The empirical results highlight the integration of static and dynamic analysis as a promising direction for code stylometry research.
UR - https://www.scopus.com/pages/publications/85056721853
UR - https://www.scopus.com/pages/publications/85056721853#tab=citedBy
U2 - 10.1145/3270101.3270110
DO - 10.1145/3270101.3270110
M3 - Conference contribution
AN - SCOPUS:85056721853
T3 - Proceedings of the ACM Conference on Computer and Communications Security
SP - 74
EP - 84
BT - AISec 2018 - Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, co-located with CCS 2018
PB - Association for Computing Machinery
T2 - 11th ACM Workshop on Artificial Intelligence and Security, AISec 2018, co-located with CCS 2018
Y2 - 19 October 2018
ER -