Author Homepage Discovery in CiteSeerX

Krutarth Patel, Cornelia Caragea, Doina Caragea, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. CiteSeerX is one such digital library search engine that provides access to more than 10 million academic documents. We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. Precisely, we integrate Web search and classification in a unified approach to discover new homepages: first, we use publicly-available author names and research paper titles as queries to a Web search engine to find relevant content, and then we identify the correct homepages from the search results using a powerful deep learning classifier based on Convolutional Neural Networks. Moreover, we use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. We show the development and deployment of the proposed approach in CiteSeerX and the maintenance requirements.

Original languageEnglish (US)
Title of host publication35th AAAI Conference on Artificial Intelligence, AAAI 2021
PublisherAssociation for the Advancement of Artificial Intelligence
Number of pages10
ISBN (Electronic)9781713835974
StatePublished - 2021
Event35th AAAI Conference on Artificial Intelligence, AAAI 2021 - Virtual, Online
Duration: Feb 2 2021Feb 9 2021

Publication series

Name35th AAAI Conference on Artificial Intelligence, AAAI 2021


Conference35th AAAI Conference on Artificial Intelligence, AAAI 2021
CityVirtual, Online

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence


Dive into the research topics of 'Author Homepage Discovery in CiteSeerX'. Together they form a unique fingerprint.

Cite this