Project Details
Description
Many modern scientific fields, such as astrophysics, bio-informatics, finance, forensics, social science, and others generate massive amounts of data that are both high-dimensional and non-standard. For example, the data may have structure such as graphs, functions, strings, and sets, but is not Euclidean. To analyze these data sets and address various statistical applications arising in these fields, efficient learning and inference procedures that can handle high-dimensional and non-standard data are needed. The functional analytic paradigm involving reproducing kernels, also known as the kernel method, provides a unified framework to handle such data and has been applied to a variety of non-parametric statistical problems with great empirical success by the machine learning community. However, its theoretical understanding in terms of statistical optimality has been limited, and computationally it scales poorly to large data. The key focus of this project is to explore various foundational research questions associated with the kernel method to achieve a statistically optimal and computationally efficient paradigm that can handle high-dimensional non-standard data. This research will significantly impact scientific development in all areas of science and engineering that intersect with statistics, and will be integrated with the PI's educational activities of mentoring students, developing new courses and forging new collaborations. Methods and code developed under this project will be made publicly available for ready use.
The core idea behind the kernel method is to map the observed data (could be high-dimensional and non-standard) to exotic function space, called the reproducing kernel Hilbert space (RKHS) and apply the standard methods developed for Euclidean data on the mapped data. Ironically, the RKHS is usually higher dimensional (even infinite-dimensional) than the dimensionality of the observed data, and is characterized by a kernel function called the reproducing kernel. The main advantage of the kernel method is its ability to explore nonlinear relationships in data by simply exploring linear relationships between the mapped elements in the RKHS through the kernel function. Despite its superior empirical performance, the statistical theory of learning algorithms based on the kernel method is not well understood except in a few cases such as classification, non-parametric least square regression, principal component analysis and goodness-of-fit testing. In tise project, the PI will explore various foundational research questions associated with the kernel method and associated learning algorithms to address this gap. The project consists of four related research themes that overall seek to deepen the mathematical understanding of the kernel method so as to exploit its full power in constructing inference procedures that can efficiently handle non-standard data. The project will also shed light on the advantages and limitations of the kernel method over other non-parametric methods in the literature. The aims are to (i) Develop statistical optimality results for kernel-based hypothesis tests and non-linear canonical correlation analysis, (ii) Develop computational vs. statistical trade-off analysis for various kernel learning procedures using approximation schemes such as Nystrom method, random features and their variations, that speed up these procedures, (iii) Develop new methodologies with concrete mathematical guarantees using the kernel method for learning and inference on functions and probability distributions, with applications in functional data analysis, and (iv) Generalize the kernel method using multi-scale kernels to obtain wavelet-like representations and investigate its statistical and computational behaviors in various learning procedures. Overall, the project will develop a comprehensive mathematical theory for computationally efficient kernel-based learning algorithms with applications in statistical learning.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Status | Active |
---|---|
Effective start/end date | 6/1/20 → 5/31/25 |
Funding
- National Science Foundation: $236,067.00