Abstract
Search engines crawl and index webpages depending upon their informative content. However, webpages - especially dynamically generated ones - contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from web-pages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.
Original language | English (US) |
---|---|
Title of host publication | Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing |
Pages | 1722-1726 |
Number of pages | 5 |
Volume | 2 |
DOIs | |
State | Published - 2005 |
Event | 20th Annual ACM Symposium on Applied Computing - Santa Fe, NM, United States Duration: Mar 13 2005 → Mar 17 2005 |
Other
Other | 20th Annual ACM Symposium on Applied Computing |
---|---|
Country/Territory | United States |
City | Santa Fe, NM |
Period | 3/13/05 → 3/17/05 |
All Science Journal Classification (ASJC) codes
- Software