TY - GEN
T1 - Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data
AU - Wu, Jian
AU - Rohatgi, Shaurya
AU - Reddy Keesara, Sai Raghav
AU - Chhay, Jason
AU - Kuo, Kevin
AU - Menon, Arjun Manoj
AU - Parsons, Sean
AU - Urgaonkar, Bhuvan
AU - Giles, C. Lee
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Since the emergence of scholarly big data, there have been several efforts for web-based services such as digital library search engines (DLSEs). However, much of the design and specifications of an accessible, usable, scalable, and sustainable DLSE have not been well represented and discussed in the literature. We argue that these four characteristics are essential to providing a high-quality service for scholarly big data from both the user and developer's perspectives. This paper reviews the design, implementation, and operation experiences, and lessons of CiteSeerX, a real-world digital library search engine. We analyze the strengths and weaknesses of the current design, and proposed a new design with a revised architecture, enhanced hardware, and software infrastructure. The Alpha version of the new design has been implemented and tested. The new system replaces MySQL and Apache Solr with a single instance of Elasticsearch, which plays a dual role of data storage and search. Another major improvement is the integration of extraction and ingestion, which significantly boosts document ingestion speed. The web application is re-engineered to enhance the user experience by applying a learning-to-rank model and offering more refined search tools. The system is also improved in many other aspects. We believe the design considerations and experience can benefit researchers and engineers who plan, design, and upgrade future systems with comparable scales and functionalities.
AB - Since the emergence of scholarly big data, there have been several efforts for web-based services such as digital library search engines (DLSEs). However, much of the design and specifications of an accessible, usable, scalable, and sustainable DLSE have not been well represented and discussed in the literature. We argue that these four characteristics are essential to providing a high-quality service for scholarly big data from both the user and developer's perspectives. This paper reviews the design, implementation, and operation experiences, and lessons of CiteSeerX, a real-world digital library search engine. We analyze the strengths and weaknesses of the current design, and proposed a new design with a revised architecture, enhanced hardware, and software infrastructure. The Alpha version of the new design has been implemented and tested. The new system replaces MySQL and Apache Solr with a single instance of Elasticsearch, which plays a dual role of data storage and search. Another major improvement is the integration of extraction and ingestion, which significantly boosts document ingestion speed. The web application is re-engineered to enhance the user experience by applying a learning-to-rank model and offering more refined search tools. The system is also improved in many other aspects. We believe the design considerations and experience can benefit researchers and engineers who plan, design, and upgrade future systems with comparable scales and functionalities.
UR - http://www.scopus.com/inward/record.url?scp=85125360466&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125360466&partnerID=8YFLogxK
U2 - 10.1109/BigData52589.2021.9671612
DO - 10.1109/BigData52589.2021.9671612
M3 - Conference contribution
AN - SCOPUS:85125360466
T3 - Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
SP - 141
EP - 152
BT - Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
A2 - Chen, Yixin
A2 - Ludwig, Heiko
A2 - Tu, Yicheng
A2 - Fayyad, Usama
A2 - Zhu, Xingquan
A2 - Hu, Xiaohua Tony
A2 - Byna, Suren
A2 - Liu, Xiong
A2 - Zhang, Jianping
A2 - Pan, Shirui
A2 - Papalexakis, Vagelis
A2 - Wang, Jianwu
A2 - Cuzzocrea, Alfredo
A2 - Ordonez, Carlos
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Big Data, Big Data 2021
Y2 - 15 December 2021 through 18 December 2021
ER -