Researcher homepage classification using unlabeled data

Sujatha Das G, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Scopus citations

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on theWeb? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Original languageEnglish (US)
Title of host publicationWWW 2013 - Proceedings of the 22nd International Conference on World Wide Web
Pages471-481
Number of pages11
StatePublished - 2013
Event22nd International Conference on World Wide Web, WWW 2013 - Rio de Janeiro, Brazil
Duration: May 13 2013May 17 2013

Publication series

NameWWW 2013 - Proceedings of the 22nd International Conference on World Wide Web

Other

Other22nd International Conference on World Wide Web, WWW 2013
Country/TerritoryBrazil
CityRio de Janeiro
Period5/13/135/17/13

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Researcher homepage classification using unlabeled data'. Together they form a unique fingerprint.

Cite this