TY - GEN
T1 - ECON
T2 - 12th International Asia Pacific Web Conference, APWeb 2010
AU - Guo, Yan
AU - Tang, Huifeng
AU - Song, Linhai
AU - Wang, Yu
AU - Ding, Guodong
PY - 2010
Y1 - 2010
N2 - This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
AB - This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
UR - http://www.scopus.com/inward/record.url?scp=77954300056&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954300056&partnerID=8YFLogxK
U2 - 10.1109/APWeb.2010.11
DO - 10.1109/APWeb.2010.11
M3 - Conference contribution
AN - SCOPUS:77954300056
SN - 9780769540122
T3 - Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010
SP - 314
EP - 320
BT - Advances in Web Technologies and Applications - Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010
Y2 - 6 April 2010 through 8 April 2010
ER -