TY - JOUR
T1 - On constructing an optimal consensus clustering from multiple clusterings
AU - Berman, Piotr
AU - DasGupta, Bhaskar
AU - Kao, Ming Yang
AU - Wang, Jie
N1 - Funding Information:
* Corresponding author. E-mail addresses: [email protected] (P. Berman), [email protected] (B. DasGupta), [email protected] (M.-Y. Kao), [email protected] (J. Wang). 1 Supported by NSF grant CCR-0208821. 2 Supported in part by NSF grants IIS-0346973, IIS-0612044 and DBI-0543365. 3 Supported in part by NSF grant EIA-0112934. 4 Supported in part by NSF under grant CCR-0296037 and CCF-04080261, and by NSF of China under grant 60273062.
PY - 2007/11/15
Y1 - 2007/11/15
N2 - Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.
AB - Computing a suitable measure of consensus among several clusterings on the same data is an important problem that arises in several areas such as computational biology and data mining. In this paper, we formalize a set-theoretic model for computing such a similarity measure. Roughly speaking, in this model we have k > 1 partitions (clusters) of the same data set each containing the same number of sets and the goal is to align the sets in each partition to minimize a similarity measure. For k = 2, a polynomial-time solution was proposed by Gusfield (Information Processing Letters 82 (2002) 159-164). In this paper, we show that the problem is MAX-SNP-hard for k = 3 even if each partition in each cluster contains no more than 2 elements and provide a 2 - frac(2, k)-approximation algorithm for the problem for any k.
UR - http://www.scopus.com/inward/record.url?scp=34548032601&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34548032601&partnerID=8YFLogxK
U2 - 10.1016/j.ipl.2007.06.008
DO - 10.1016/j.ipl.2007.06.008
M3 - Article
AN - SCOPUS:34548032601
SN - 0020-0190
VL - 104
SP - 137
EP - 145
JO - Information Processing Letters
JF - Information Processing Letters
IS - 4
ER -