Identifying content blocks from Web documents

Sandip Debnath, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

37 Scopus citations

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

Original languageEnglish (US)
Title of host publicationFoundations of Intelligent Systems - 15th International Symposium, ISMIS 2005, Proceedings
PublisherSpringer Verlag
Pages285-293
Number of pages9
ISBN (Print)3540258787, 9783540258780
DOIs
StatePublished - 2005
Event15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005 - Saratoga Springs, NY, United States
Duration: May 25 2005May 28 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3488 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th International Symposium on Methodologies for Intelligent Systems, ISMIS 2005
Country/TerritoryUnited States
CitySaratoga Springs, NY
Period5/25/055/28/05

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Identifying content blocks from Web documents'. Together they form a unique fingerprint.

Cite this