TY - GEN
T1 - On identifying academic homepages for digital libraries
AU - Gollapalli, Sujatha Das
AU - Giles, C. Lee
AU - Mitra, Prasenjit
AU - Caragea, Cornelia
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2011
Y1 - 2011
N2 - Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.
AB - Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.
UR - http://www.scopus.com/inward/record.url?scp=79960522068&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79960522068&partnerID=8YFLogxK
U2 - 10.1145/1998076.1998099
DO - 10.1145/1998076.1998099
M3 - Conference contribution
AN - SCOPUS:79960522068
SN - 9781450307444
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 123
EP - 132
BT - JCDL'11 - Proceedings of the 2011 ACM/IEEE Joint Conference on Digital Libraries
T2 - 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL'11
Y2 - 13 June 2011 through 17 June 2011
ER -