CiteSeerX: AI in a digital library search engine

Jian Wu, Kyle William, Hung Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan, Prasenjit Mitra, C. Lee Giles

Research output: Contribution to journalArticlepeer-review

44 Scopus citations


CiteSeerX is a digital library search engine that provides access to more than 5 million scholarly documents with nearly a million users and millions of hits per day. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5-6 years. We show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. We also present AI technologies, implemented in table and algorithm search, that are special search modes in CiteSeerX. While it is challenging to rebuild a system like Cite-SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines.

Original languageEnglish (US)
Pages (from-to)35-48
Number of pages14
JournalAI Magazine
Issue number3
StatePublished - Sep 1 2015

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence


Dive into the research topics of 'CiteSeerX: AI in a digital library search engine'. Together they form a unique fingerprint.

Cite this