TY - GEN
T1 - Searching online book documents and analyzing book citations
AU - Wu, Zhaohui
AU - Das, Sujatha
AU - Li, Zhenhui
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2013
Y1 - 2013
N2 - Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.
AB - Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.
UR - http://www.scopus.com/inward/record.url?scp=84887328647&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84887328647&partnerID=8YFLogxK
U2 - 10.1145/2494266.2494282
DO - 10.1145/2494266.2494282
M3 - Conference contribution
AN - SCOPUS:84887328647
SN - 9781450317894
T3 - DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering
SP - 81
EP - 90
BT - DocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering
PB - Association for Computing Machinery
T2 - 2013 ACM Symposium on Document Engineering, DocEng 2013
Y2 - 10 September 2013 through 13 September 2013
ER -