This project develops a new set of tools, called Information Matrix Analysis (IMA), to explore the structure of multivariate data. With modern data gathering devices and vast data storage space, researchers can easily collect high-dimensional data, such as biotech data, financial data, satellite imagery, and hyperspectral imagery. Analysis of such high-dimensional data poses great challenges for statisticians due to the so-called 'curse of dimensionality'. The IMA methodology tackles this problem by finding a smaller number of linear combinations of the original variables that will carry most of the information of the original multivariate data. The IMA is based on the eigenanalysis of the Information Matrix as defined for different problems. The eigenvectors of the proposed information matrices provide the linear combinations of the variables that best summarize the useful features of the data due to their high information content. By combining the new method with the random projection method, one can also apply IMA to ultra high dimensional data. Alternatively, one can do variable selection with IMA. The new data analysis tool under development in this project is timely due to the data revolution, as well as broadly applicable. All of the IMA methodology springs from a common foundation, and so is easily transported to tackle new problems. This project will enhance significantly the availability of statistical tools and software for statistical modeling and exploration for multivariate data. The new method will benefit a broad range of scientists and researchers who want to analyze high-dimensional data in various fields, including medical studies, prevention studies, public health, and the social sciences.
This project will accelerate geometric understanding of Fisherian information matrices. The core of this project consists of the application of IMA to three very different problem areas: building models, assessing models, and comparing populations. Suppose one wants to build a model to analyze the relationship between a response variable and multivariate covariates. The IMA of a defined covariate information matrix can be applied to find a smaller number of linear projections of the original covariates to simplify model building. The new projected variables have the nice explanation of carrying most information, as measured by the defined covariate information, about the relationship between the covariates and the response. The IMA can be also generalized to find linear combinations of the data that can provide the best discrimination between two densities. In the applications, one density will be the true unknown density, to be estimated nonparametrically, and the other density will be some model for the data, which could be parametric or semiparametric. Alternatively, the two densities could represent two distinct populations one wishes to compare. When the two populations are multivariate normal with the same covariance matrix, the IMA provides the same linear projection as the Fisher linear discriminant direction. However, the IMA can move beyond linear discriminant analysis to multiple linear discriminants. The IMA will be further applied to find linear projections of the data that are useful for assessing the fit of a proposed model, whether parametric, semiparametric, or nonparametric. This project will investigate the applications of IMA to popular graphical models and independent component analysis models. It is expected that IMA might have many more application areas. For high dimensional data, the proposal will develop the use of the random projection method to first reduce the dimension of the original variables and restrict the space of possible linear combinations. Then the IMA can be applied to a much lower dimensional data set. Alternatively, one can do variable selection with IMA by imposing the sparsity of the informative projections.
|Effective start/end date
|8/15/14 → 7/31/19
- National Science Foundation: $240,000.00