Analysis of lexical signatures for finding lost or related documents

Seung Taek Park, David M. Pennock, C. Lee Giles, Robert Krovetz

Research output: Contribution to journalConference articlepeer-review

16 Scopus citations

Abstract

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.

Original languageEnglish (US)
Pages (from-to)11-18
Number of pages8
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
DOIs
StatePublished - 2002
EventProceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Tampere, Finland
Duration: Aug 11 2002Aug 15 2002

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Analysis of lexical signatures for finding lost or related documents'. Together they form a unique fingerprint.

Cite this