TY - GEN
T1 - CRD
T2 - 2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
AU - Pan, Feng
AU - Zhang, Xiang
AU - Wang, Wei
PY - 2008
Y1 - 2008
N2 - The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.
AB - The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.
UR - http://www.scopus.com/inward/record.url?scp=57149147732&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57149147732&partnerID=8YFLogxK
U2 - 10.1145/1376616.1376637
DO - 10.1145/1376616.1376637
M3 - Conference contribution
AN - SCOPUS:57149147732
SN - 9781605581026
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 173
EP - 184
BT - SIGMOD 2008
Y2 - 9 June 2008 through 12 June 2008
ER -