Exploring web scale language models for search query processing

Jian Huang, Jianfeng Gao, Jiangbo Miao, Xiaolong Li, Kuansan Wang, Fritz Behr, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

48 Scopus citations

Abstract

It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work. We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.

Original languageEnglish (US)
Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
Pages451-460
Number of pages10
DOIs
StatePublished - 2010
Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
Duration: Apr 26 2010Apr 30 2010

Publication series

NameProceedings of the 19th International Conference on World Wide Web, WWW '10

Other

Other19th International World Wide Web Conference, WWW2010
Country/TerritoryUnited States
CityRaleigh, NC
Period4/26/104/30/10

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Exploring web scale language models for search query processing'. Together they form a unique fingerprint.

Cite this