TY - JOUR
T1 - Sec-lib
T2 - Protecting scholarly digital libraries from infected papers using active machine learning framework
AU - Nissim, Nir
AU - Cohen, Aviad
AU - Wu, Jian
AU - Lanzi, Andrea
AU - Rokach, Lior
AU - Elovici, Yuval
AU - Giles, Lee
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2019
Y1 - 2019
N2 - Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles' files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-Target and inadvertently support the proliferation of cyber-Attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3-2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection framework aimed at enhancing the detection of malicious PDF documents, Sec-Lib, which offers a security solution for large digital libraries. Sec-Lib includes a deterministic layer for detecting known malware, and a machine learning based layer for detecting unknown malware. Our evaluation showed that scholarly digital libraries can detect 96.9% of malware with Sec-Lib, while minimizing the number of PDF-files requiring labeling, and thus reducing the manual inspection efforts of security-experts by 98%.
AB - Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles' files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-Target and inadvertently support the proliferation of cyber-Attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3-2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection framework aimed at enhancing the detection of malicious PDF documents, Sec-Lib, which offers a security solution for large digital libraries. Sec-Lib includes a deterministic layer for detecting known malware, and a machine learning based layer for detecting unknown malware. Our evaluation showed that scholarly digital libraries can detect 96.9% of malware with Sec-Lib, while minimizing the number of PDF-files requiring labeling, and thus reducing the manual inspection efforts of security-experts by 98%.
UR - http://www.scopus.com/inward/record.url?scp=85075210352&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075210352&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2019.2933197
DO - 10.1109/ACCESS.2019.2933197
M3 - Article
AN - SCOPUS:85075210352
SN - 2169-3536
VL - 7
SP - 110050
EP - 110073
JO - IEEE Access
JF - IEEE Access
M1 - 8788686
ER -