TY - GEN
T1 - A general framework for fast co-clustering on large datasets using matrix decomposition
AU - Pan, Feng
AU - Zhang, Xiang
AU - Wang, Wei
PY - 2008
Y1 - 2008
N2 - Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.
AB - Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.
UR - http://www.scopus.com/inward/record.url?scp=52649158129&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=52649158129&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2008.4497548
DO - 10.1109/ICDE.2008.4497548
M3 - Conference contribution
C2 - 20419039
AN - SCOPUS:52649158129
SN - 9781424418374
T3 - Proceedings - International Conference on Data Engineering
SP - 1337
EP - 1339
BT - Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
T2 - 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
Y2 - 7 April 2008 through 12 April 2008
ER -