TY - GEN
T1 - PDFMEF
T2 - 8th International Conference on Knowledge Capture, K-CAP 2015
AU - Wu, Jian
AU - Killian, Jason
AU - Yang, Huaiyu
AU - Williams, Kyle
AU - Choudhury, Sagnik Ray
AU - Tuarob, Suppawong
AU - Caragea, Cornelia
AU - Giles, C. Lee
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/10/7
Y1 - 2015/10/7
N2 - We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.
AB - We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.
UR - http://www.scopus.com/inward/record.url?scp=84997208457&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84997208457&partnerID=8YFLogxK
U2 - 10.1145/2815833.2815834
DO - 10.1145/2815833.2815834
M3 - Conference contribution
AN - SCOPUS:84997208457
T3 - Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015
BT - Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015
PB - Association for Computing Machinery, Inc
Y2 - 7 October 2015 through 10 October 2015
ER -