TY - GEN
T1 - Identifying content blocks from Web documents
AU - Debnath, Sandip
AU - Mitra, Prasenjit
AU - Lee Giles, C.
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2005
Y1 - 2005
N2 - Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.
AB - Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.
UR - http://www.scopus.com/inward/record.url?scp=26944496810&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=26944496810&partnerID=8YFLogxK
U2 - 10.1007/11425274_30
DO - 10.1007/11425274_30
M3 - Conference contribution
AN - SCOPUS:26944496810
SN - 3540258787
SN - 9783540258780
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 285
EP - 293
BT - Foundations of Intelligent Systems - 15th International Symposium, ISMIS 2005, Proceedings
PB - Springer Verlag
T2 - 15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005
Y2 - 25 May 2005 through 28 May 2005
ER -