Mining, indexing, and searching for textual chemical molecule information on the web

Bingjun Sun, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Scopus citations

Abstract

Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

Original languageEnglish (US)
Title of host publicationProceeding of the 17th International Conference on World Wide Web 2008, WWW'08
Pages735-744
Number of pages10
DOIs
StatePublished - 2008
Event17th International Conference on World Wide Web 2008, WWW'08 - Beijing, China
Duration: Apr 21 2008Apr 25 2008

Publication series

NameProceeding of the 17th International Conference on World Wide Web 2008, WWW'08

Other

Other17th International Conference on World Wide Web 2008, WWW'08
Country/TerritoryChina
CityBeijing
Period4/21/084/25/08

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Mining, indexing, and searching for textual chemical molecule information on the web'. Together they form a unique fingerprint.

Cite this