TY - JOUR
T1 - Fine-grained compiler identification with sequence-oriented neural modeling
AU - Tian, Zhenzhou
AU - Huang, Yaqian
AU - Xie, Borun
AU - Chen, Yanping
AU - Chen, Lingwei
AU - Wu, Dinghao
N1 - Publisher Copyright:
© 2021 American Institute of Physics Inc.. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Different compilers and optimization levels can be used to compile the source code. Revealed in reverse from the produced binaries, these compiler details facilitate essential binary analysis tasks, such as malware analysis and software forensics. Most existing approaches adopt a signature matching based or machine learning based strategy to identify the compiler details, showing limits in either the detection accuracy or granularity. In this work, we propose NeuralCI (Neural modeling-based Compiler Identification) to infer these compiler details including compiler family, optimization level and compiler version on individual functions. The basic idea is to formulate sequence-oriented neural networks to process normalized instruction sequences generated using a lightweight function abstraction strategy. To evaluate the performance of NeuralCI, a large dataset consisting of 854,858 unique functions collected from 19 widely used real-world projects is constructed. The experiments show that NeuralCI achieves averagely 98.6% accuracy in identifying the compiler family, 95.3% accuracy in identifying the optimization level, 88.7% accuracy in identifying the compiler version, 94.8% accuracy in identifying the compiler family and optimization level, and 83.0% accuracy in identifying all compiler components simultaneously, outperforming existing function level compiler identification methods in terms of both detection accuracy and comprehensiveness.
AB - Different compilers and optimization levels can be used to compile the source code. Revealed in reverse from the produced binaries, these compiler details facilitate essential binary analysis tasks, such as malware analysis and software forensics. Most existing approaches adopt a signature matching based or machine learning based strategy to identify the compiler details, showing limits in either the detection accuracy or granularity. In this work, we propose NeuralCI (Neural modeling-based Compiler Identification) to infer these compiler details including compiler family, optimization level and compiler version on individual functions. The basic idea is to formulate sequence-oriented neural networks to process normalized instruction sequences generated using a lightweight function abstraction strategy. To evaluate the performance of NeuralCI, a large dataset consisting of 854,858 unique functions collected from 19 widely used real-world projects is constructed. The experiments show that NeuralCI achieves averagely 98.6% accuracy in identifying the compiler family, 95.3% accuracy in identifying the optimization level, 88.7% accuracy in identifying the compiler version, 94.8% accuracy in identifying the compiler family and optimization level, and 83.0% accuracy in identifying all compiler components simultaneously, outperforming existing function level compiler identification methods in terms of both detection accuracy and comprehensiveness.
UR - http://www.scopus.com/inward/record.url?scp=85107202308&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107202308&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3069227
DO - 10.1109/ACCESS.2021.3069227
M3 - Article
AN - SCOPUS:85107202308
SN - 2169-3536
VL - 9
SP - 49160
EP - 49175
JO - IEEE Access
JF - IEEE Access
M1 - 9388681
ER -