TY - GEN
T1 - Extracting researcher metadata with labeled features
AU - Das Gollapalli, Sujatha
AU - Qi, Yanjun
AU - Mitra, Prasenjit
AU - Giles, C. Lee
N1 - Publisher Copyright:
Copyright © SIAM.
PY - 2014
Y1 - 2014
N2 - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.
AB - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.
UR - http://www.scopus.com/inward/record.url?scp=84959872994&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959872994&partnerID=8YFLogxK
U2 - 10.1137/1.9781611973440.85
DO - 10.1137/1.9781611973440.85
M3 - Conference contribution
AN - SCOPUS:84959872994
T3 - SIAM International Conference on Data Mining 2014, SDM 2014
SP - 740
EP - 748
BT - SIAM International Conference on Data Mining 2014, SDM 2014
A2 - Zaki, Mohammed
A2 - Obradovic, Zoran
A2 - Ning-Tan, Pang
A2 - Banerjee, Arindam
A2 - Kamath, Chandrika
A2 - Parthasarathy, Srinivasan
PB - Society for Industrial and Applied Mathematics Publications
T2 - 14th SIAM International Conference on Data Mining, SDM 2014
Y2 - 24 April 2014 through 26 April 2014
ER -