Searching online book documents and analyzing book citations

Zhaohui Wu, Sujatha Das, Zhenhui Li, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations


Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

Original languageEnglish (US)
Title of host publicationDocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery
Number of pages10
ISBN (Print)9781450317894
StatePublished - 2013
Event2013 ACM Symposium on Document Engineering, DocEng 2013 - Florence, Italy
Duration: Sep 10 2013Sep 13 2013

Publication series

NameDocEng 2013 - Proceedings of the 2013 ACM Symposium on Document Engineering


Other2013 ACM Symposium on Document Engineering, DocEng 2013

All Science Journal Classification (ASJC) codes

  • Software


Dive into the research topics of 'Searching online book documents and analyzing book citations'. Together they form a unique fingerprint.

Cite this