SimSeerX: A similar document search engine

Kyle Williams, Jian Wu, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Scopus citations

Abstract

The need to find similar documents occurs in many settings, such as in plagiarism detection or research paper recommendation. Manually constructing queries to find similar documents may be overly complex, thus motivating the use of whole documents as queries. This paper introduces Sim-SeerX, a search engine for similar document retrieval that receives whole documents as queries and returns a ranked list of similar documents. Key to the design of SimSeerX is that is able to work with multiple similarity functions and document collections. We present the architecture and interface of SimSeerX, show its applicability with 3 different similarity functions and demonstrate its scalability on a collection of 3.5 million academic documents.

Original languageEnglish (US)
Title of host publicationDocEng 2014 - Proceedings of the 2014 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
Pages143-146
Number of pages4
ISBN (Electronic)9781450329491
DOIs
StatePublished - 2014
Event2014 ACM Symposium on Document Engineering, DocEng 2014 - Fort Collins, United States
Duration: Sep 16 2014Sep 19 2014

Publication series

NameDocEng 2014 - Proceedings of the 2014 ACM Symposium on Document Engineering

Other

Other2014 ACM Symposium on Document Engineering, DocEng 2014
Country/TerritoryUnited States
CityFort Collins
Period9/16/149/19/14

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'SimSeerX: A similar document search engine'. Together they form a unique fingerprint.

Cite this