Methods for Sampling Pages Uniformly from the World Wide Web

Paat Rusmevichientong, David M. Pennock, Steve Lawrence, C. Lee Giles

Research output: Contribution to conferencePaperpeer-review

66 Scopus citations


We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to the web. We show that, in the limit, this algorithm generates samples that are uniformly random. The second algorithm (UNDIRECTED-SAMPLE) operates on undirected graphs, thus requiring a mechanism for obtaining inbound links to web pages (e.g., access to a search engine). With this additional knowledge of inbound links, the algorithm can arrive at a uniform distribution faster than DIRECTEDSAMPLE, and we derive explicit bounds on the time to convergence. In addition, we evaluate the two algorithms on simulated web data, showing that both yield reliably uniform samples of pages. We also compare our results with those of previous algorithms, and discuss the theoretical relationships among the various proposed methods.

Original languageEnglish (US)
Number of pages8
StatePublished - 2001
Event2001 AAAI Fall Symposium - North Falmouth, United States
Duration: Nov 2 2001Nov 4 2001


Conference2001 AAAI Fall Symposium
Country/TerritoryUnited States
CityNorth Falmouth

All Science Journal Classification (ASJC) codes

  • General Engineering

Cite this