TY - GEN
T1 - CiteSeerx
T2 - 36th European Conference on Information Retrieval, ECIR 2014
AU - Caragea, Cornelia
AU - Wu, Jian
AU - Ciobanu, Alina
AU - Williams, Kyle
AU - Fernández-Ramírez, Juan
AU - Chen, Hung Hsuan
AU - Wu, Zhaohui
AU - Giles, Lee
PY - 2014
Y1 - 2014
N2 - The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x, which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.
AB - The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x, which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.
UR - http://www.scopus.com/inward/record.url?scp=84899928992&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84899928992&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-06028-6_26
DO - 10.1007/978-3-319-06028-6_26
M3 - Conference contribution
AN - SCOPUS:84899928992
SN - 9783319060279
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 311
EP - 322
BT - Advances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Proceedings
PB - Springer Verlag
Y2 - 13 April 2014 through 16 April 2014
ER -