TY - GEN
T1 - Learning metadata from the evidence in an on-line citation matching scheme
AU - Councill, Isaac G.
AU - Li, Huajing
AU - Zhuang, Ziming
AU - Debnath, Sandip
AU - Bolelli, Levent
AU - Lee, Wang Chien
AU - Sivasubramaniam, Anand
AU - Giles, C. Lee
N1 - Funding Information:
I am much indebted to Dr. Patricia Mather and an anonymous reviewer for valuable comments on the manuscript. This work was partly supported by personal grant No. 95-04-1113a from the Russian Foundation of the Fundamental Researches.
PY - 2006
Y1 - 2006
N2 - Citation matching, or the automatic grouping of bibliographic references that refer to the same document, is a data management problem faced by automatic digital libraries for scientific literature such as CiteSeer and Google Scholar. Although several solutions have been offered for citation matching in large bibliographic databases, these solutions typically require expensive batch clustering operations that must be run offline. Large digital libraries containing citation information can reduce maintenance costs and provide new services through efficient online processing of citation data, resolving document citation relationships as new records become available. Additionally, information found in citations can be used to supplement document metadata, requiring the generation of a canonical citation record from merging variant citation subfields into a unified "best guess" from which to draw information. Citation information must be merged with other information sources in order to provide a complete document record. This paper outlines a system and algorithms for online citation matching and canonical metadata generation. A Bayesian framework is employed to build the ideal citation record for a document that carries the added advantages of fusing information from disparate sources and increasing system resilience to erroneous data.
AB - Citation matching, or the automatic grouping of bibliographic references that refer to the same document, is a data management problem faced by automatic digital libraries for scientific literature such as CiteSeer and Google Scholar. Although several solutions have been offered for citation matching in large bibliographic databases, these solutions typically require expensive batch clustering operations that must be run offline. Large digital libraries containing citation information can reduce maintenance costs and provide new services through efficient online processing of citation data, resolving document citation relationships as new records become available. Additionally, information found in citations can be used to supplement document metadata, requiring the generation of a canonical citation record from merging variant citation subfields into a unified "best guess" from which to draw information. Citation information must be merged with other information sources in order to provide a complete document record. This paper outlines a system and algorithms for online citation matching and canonical metadata generation. A Bayesian framework is employed to build the ideal citation record for a document that carries the added advantages of fusing information from disparate sources and increasing system resilience to erroneous data.
UR - http://www.scopus.com/inward/record.url?scp=34247205876&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34247205876&partnerID=8YFLogxK
U2 - 10.1145/1141753.1141817
DO - 10.1145/1141753.1141817
M3 - Conference contribution
AN - SCOPUS:34247205876
SN - 1595933549
SN - 9781595933546
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 276
EP - 285
BT - 6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006
T2 - 6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006: Opening Information Horizons, JCDL '06
Y2 - 11 June 2006 through 15 June 2006
ER -