CAREER: A Theoretical Exploration of Efficient and Accurate Clustering Algorithms

Project: Research project

Project Details

Description

Clustering, a fundamental technique in data analysis and machine learning, involves grouping data points to ensure higher similarities within a group (cluster) than across clusters. Since its inception in the early 20th century, clustering has proven highly valuable in diverse fields such as biology, economics, marketing, statistics, computer science, and social network analysis. This CAREER project aims to create a unified framework along with various tools and techniques to design optimal approximation algorithms for a broad range of NP-hard clustering problems. Despite significant progress in developing efficient approximations for computationally hard clustering problems, the analysis of the large-scale and complex datasets remains challenging resulting in suboptimal accuracy and limiting practical applications in science and engineering. The algorithmic development and analysis are expected to have a direct impact on the fields of data science and bioinformatics. The educational components of the project include a 3-day summer workshop for high school students, training and mentoring of graduate students, development of new courses, and the fostering of student participation from underrepresented groups in theoretical computer science.The project comprises two primary thrusts. The first thrust focuses on addressing clustering challenges in strings with special emphasis on centroid-based clustering problems such as k-median and k-center. This study will encompass various metric spaces including edit distance, Ulam, and Kendall tau. The objective is to formulate methodologies that transcend the limitations set by the triangle inequality, yielding polynomial-time algorithms with approximations arbitrarily close to one. The second thrust focuses on issues related to hierarchical clustering. Here the objective is to evaluate how well the techniques and frameworks originally devised for addressing string clustering problems can be adapted for aggregating hierarchical clusters. Additionally, the project will develop new methods for constructing hierarchical clusters specifically tailored to address challenges related to large datasets. The new tools and combinatorial methods developed in this project are expected to enhance algorithms for clustering problems and should find applicability in various other domains including communication complexity, streaming algorithms, and distributed systems.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
StatusActive
Effective start/end date5/1/244/30/29

Funding

  • National Science Foundation: $647,681.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.