A top-down method for mining most specific frequent patterns in biological sequence data

Martin Ester, Xiang Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Scopus citations

Abstract

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.

Original languageEnglish (US)
Title of host publicationProceedings of the Fourth SIAM International Conference on Data Mining
EditorsM.W. Berry, U. Dayal, C. Kamath, D. Skillicorn
Pages90-101
Number of pages12
StatePublished - 2004
EventProceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States
Duration: Apr 22 2004Apr 24 2004

Other

OtherProceedings of the Fourth SIAM International Conference on Data Mining
Country/TerritoryUnited States
CityLake Buena Vista, FL
Period4/22/044/24/04

All Science Journal Classification (ASJC) codes

  • General Mathematics

Fingerprint

Dive into the research topics of 'A top-down method for mining most specific frequent patterns in biological sequence data'. Together they form a unique fingerprint.

Cite this