Efficient record-level wrapper induction

Shuyi Zheng, Ruihua Song, Ji Rong Wen, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

38 Scopus citations

Abstract

Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ''broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.

Original languageEnglish (US)
Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Pages47-55
Number of pages9
DOIs
StatePublished - 2009
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: Nov 2 2009Nov 6 2009

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

OtherACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Country/TerritoryChina
CityHong Kong
Period11/2/0911/6/09

All Science Journal Classification (ASJC) codes

  • General Business, Management and Accounting
  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Efficient record-level wrapper induction'. Together they form a unique fingerprint.

Cite this