TY - GEN
T1 - Value-based program characterization and its application to software plagiarism detection
AU - Jhi, Yoon Chan
AU - Wang, Xinran
AU - Jia, Xiaoqi
AU - Zhu, Sencun
AU - Liu, Peng
AU - Wu, Dinghao
PY - 2011
Y1 - 2011
N2 - Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (1) most of them cannot handle advanced obfuscation techniques; (2) the methods based on source code analysis are less practical since the source code of suspicious programs is typically not available until strong evidences are collected; and (3) those depending on the features of specific operating systems or programming languages have limited applicability. Based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.
AB - Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (1) most of them cannot handle advanced obfuscation techniques; (2) the methods based on source code analysis are less practical since the source code of suspicious programs is typically not available until strong evidences are collected; and (3) those depending on the features of specific operating systems or programming languages have limited applicability. Based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.
UR - http://www.scopus.com/inward/record.url?scp=79959902968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959902968&partnerID=8YFLogxK
U2 - 10.1145/1985793.1985899
DO - 10.1145/1985793.1985899
M3 - Conference contribution
AN - SCOPUS:79959902968
SN - 9781450304450
T3 - Proceedings - International Conference on Software Engineering
SP - 756
EP - 765
BT - ICSE 2011 - 33rd International Conference on Software Engineering, Proceedings of the Conference
T2 - 33rd International Conference on Software Engineering, ICSE 2011
Y2 - 21 May 2011 through 28 May 2011
ER -