An architecture for information extraction from figures in digital libraries

  • Sagnik Ray Choudhury
  • , C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scholarly documents contain multiple gures representing experimental findings. These gures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such gures. Our architecture consists of the following modules: 1. An ex- tractor for gures and associated metadata ( gure captions and mentions) from PDF documents; 2. A Search engine on the extracted gures and metadata; 3. An image processing module for automated data extraction from the gures and 4. A natural language processing module to understand the semantics of the gure. We discuss the challenges in each step, report an extractor algorithm to extract vector graph- ics from scholarly documents and aspecification algorithm for gures. Our extractor algorithm improves the state of the art by more than 10% and thespecification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from gures which is integrated with our search engine to improve user experience.

Original languageEnglish (US)
Title of host publicationWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
PublisherAssociation for Computing Machinery, Inc
Pages667-672
Number of pages6
ISBN (Electronic)9781450334730
DOIs
StatePublished - May 18 2015
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: May 18 2015May 22 2015

Publication series

NameWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

Other

Other24th International Conference on World Wide Web, WWW 2015
Country/TerritoryItaly
CityFlorence
Period5/18/155/22/15

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'An architecture for information extraction from figures in digital libraries'. Together they form a unique fingerprint.

Cite this