TY - GEN
T1 - Scholarly Big Data Quality Assessment
T2 - 22nd ACM Symposium on Document Engineering, DocEng 2022
AU - Wu, Jian
AU - Hiltabrand, Ryan
AU - Soós, Dominik
AU - Giles, C. Lee
N1 - Funding Information:
We gratefully acknowledge partial support from the National Science Foundation (Award #1823288).
Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/9/20
Y1 - 2022/9/20
N2 - Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.
AB - Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.
UR - http://www.scopus.com/inward/record.url?scp=85143129169&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143129169&partnerID=8YFLogxK
U2 - 10.1145/3558100.3563850
DO - 10.1145/3558100.3563850
M3 - Conference contribution
AN - SCOPUS:85143129169
T3 - DocEng 2022 - Proceedings of the 2022 ACM Symposium on Document Engineering
BT - DocEng 2022 - Proceedings of the 2022 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
Y2 - 20 September 2022 through 23 September 2022
ER -