TY - GEN
T1 - Scholarly big data information extraction and integration in the CiteSeerχ digital library
AU - Williams, Kyle
AU - Wu, Jian
AU - Choudhury, Sagnik Ray
AU - Khabsa, Madian
AU - Giles, C. Lee
PY - 2014
Y1 - 2014
N2 - CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeer χ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.
AB - CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeer χ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.
UR - http://www.scopus.com/inward/record.url?scp=84901755944&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84901755944&partnerID=8YFLogxK
U2 - 10.1109/ICDEW.2014.6818305
DO - 10.1109/ICDEW.2014.6818305
M3 - Conference contribution
AN - SCOPUS:84901755944
SN - 9781479934805
T3 - Proceedings - International Conference on Data Engineering
SP - 68
EP - 73
BT - 2014 IEEE 30th International Conference on Data Engineering Workshops, ICDEW 2014
PB - IEEE Computer Society
T2 - 2014 IEEE 30th International Conference on Data Engineering Workshops, ICDEW 2014
Y2 - 31 March 2014 through 4 April 2014
ER -