Extracting researcher metadata with labeled features

Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations


Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

Original languageEnglish (US)
Title of host publicationSIAM International Conference on Data Mining 2014, SDM 2014
EditorsMohammed Zaki, Zoran Obradovic, Pang Ning-Tan, Arindam Banerjee, Chandrika Kamath, Srinivasan Parthasarathy
PublisherSociety for Industrial and Applied Mathematics Publications
Number of pages9
ISBN (Electronic)9781510811515
StatePublished - 2014
Event14th SIAM International Conference on Data Mining, SDM 2014 - Philadelphia, United States
Duration: Apr 24 2014Apr 26 2014

Publication series

NameSIAM International Conference on Data Mining 2014, SDM 2014


Other14th SIAM International Conference on Data Mining, SDM 2014
Country/TerritoryUnited States

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software


Dive into the research topics of 'Extracting researcher metadata with labeled features'. Together they form a unique fingerprint.

Cite this