CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition

Feng Pan, Xiang Zhang, Wei Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Scopus citations

Abstract

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

Original languageEnglish (US)
Title of host publicationSIGMOD 2008
Subtitle of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data 2008
Pages173-184
Number of pages12
DOIs
StatePublished - 2008
Event2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08 - Vancouver, BC, Canada
Duration: Jun 9 2008Jun 12 2008

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
Country/TerritoryCanada
CityVancouver, BC
Period6/9/086/12/08

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition'. Together they form a unique fingerprint.

Cite this