TY - GEN
T1 - Efficient record-level wrapper induction
AU - Zheng, Shuyi
AU - Song, Ruihua
AU - Wen, Ji Rong
AU - Giles, C. Lee
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2009
Y1 - 2009
N2 - Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ''broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.
AB - Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ''broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.
UR - http://www.scopus.com/inward/record.url?scp=74549208580&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74549208580&partnerID=8YFLogxK
U2 - 10.1145/1645953.1645962
DO - 10.1145/1645953.1645962
M3 - Conference contribution
AN - SCOPUS:74549208580
SN - 9781605585123
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 47
EP - 55
BT - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
T2 - ACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Y2 - 2 November 2009 through 6 November 2009
ER -