TY - GEN
T1 - Web crawler middleware for search engine digital libraries
T2 - 12th ACM International Workshop on Web Information and Data Management, WIDM 2012 - Co-located with CIKM 2012
AU - Wu, Jian
AU - Teregowda, Pradeep
AU - Khabsa, Madian
AU - Carman, Stephen
AU - Jordan, Douglas
AU - Wandelmer, Jose San Pedro
AU - Lu, Xin
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2012
Y1 - 2012
N2 - Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.
AB - Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.
UR - http://www.scopus.com/inward/record.url?scp=84870493887&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870493887&partnerID=8YFLogxK
U2 - 10.1145/2389936.2389949
DO - 10.1145/2389936.2389949
M3 - Conference contribution
AN - SCOPUS:84870493887
SN - 9781450317207
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 57
EP - 64
BT - WIDM'12 - Proceedings of the 12th ACM International Workshop on Web Information and Data Management, Co-located with CIKM 2012
Y2 - 2 November 2012 through 2 November 2012
ER -