TY - GEN
T1 - Automatic knowledge base construction from scholarly documents
AU - Al-Zaidy, Rabah A.
AU - Giles, C. L.C.
PY - 2017/8/31
Y1 - 2017/8/31
N2 - The continuing growth of published scholarly content on the web ensures the availability of the most recent scient findings to researchers. Scholarly documents, such as research articles, are easily accessed by using academic search engines that are built on large repositories of scholarly documents. Scienti.c information extraction from documents into a structured knowledge graph representation facilitates automated machine understanding of a document's content. Traditional information extraction approaches, that either require training samples or a preexisting knowledge base to assist in the extraction, can be challenging when applied to large repositories of digital documents. Labeled training examples for such large scale are diicult to obtain for such datasets. Also, most available knowledge bases are built from web data and do not have suicient coverage to include concepts found in scienti.c articles. In this paper we aim to construct a knowledge graph from scholarly documents while addressing both these issues. We propose a fully automatic, unsupervised system for scienti.c information extraction that does not build on an existing knowledge base and avoids manually-tagged training data. We describe and evaluate a constructed taxonomy that contains over 15k entities resulting from applying our approach to 10k documents.
AB - The continuing growth of published scholarly content on the web ensures the availability of the most recent scient findings to researchers. Scholarly documents, such as research articles, are easily accessed by using academic search engines that are built on large repositories of scholarly documents. Scienti.c information extraction from documents into a structured knowledge graph representation facilitates automated machine understanding of a document's content. Traditional information extraction approaches, that either require training samples or a preexisting knowledge base to assist in the extraction, can be challenging when applied to large repositories of digital documents. Labeled training examples for such large scale are diicult to obtain for such datasets. Also, most available knowledge bases are built from web data and do not have suicient coverage to include concepts found in scienti.c articles. In this paper we aim to construct a knowledge graph from scholarly documents while addressing both these issues. We propose a fully automatic, unsupervised system for scienti.c information extraction that does not build on an existing knowledge base and avoids manually-tagged training data. We describe and evaluate a constructed taxonomy that contains over 15k entities resulting from applying our approach to 10k documents.
UR - http://www.scopus.com/inward/record.url?scp=85030555381&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85030555381&partnerID=8YFLogxK
U2 - 10.1145/3103010.3121043
DO - 10.1145/3103010.3121043
M3 - Conference contribution
AN - SCOPUS:85030555381
T3 - DocEng 2017 - Proceedings of the 2017 ACM Symposium on Document Engineering
SP - 149
EP - 152
BT - DocEng 2017 - Proceedings of the 2017 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
T2 - 17th ACM Symposium on Document Engineering, DocEng 2017
Y2 - 4 September 2017 through 7 September 2017
ER -