CiteSeerX data: Semanticizing scholarly papers

Jian Wu, Chen Liang, Huaiyu Yang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.

Original languageEnglish (US)
Title of host publicationProceedings of the International Workshop on Semantic Big Data, SBD 2016, in conjunction with the 2016 ACM SIGMOD/PODS Conference
EditorsLe Gruenwald, Sven Groppe
PublisherAssociation for Computing Machinery
ISBN (Print)9781450342995
DOIs
StatePublished - Jun 26 2016
Event2016 International Workshop on Semantic Big Data, SBD 2016, in conjunction with the 2016 ACM SIGMOD/PODS Conference - San Francisco, United States
Duration: Jul 1 2016 → …

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other2016 International Workshop on Semantic Big Data, SBD 2016, in conjunction with the 2016 ACM SIGMOD/PODS Conference
Country/TerritoryUnited States
CitySan Francisco
Period7/1/16 → …

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'CiteSeerX data: Semanticizing scholarly papers'. Together they form a unique fingerprint.

Cite this