Project Details
Description
The main goal of this project is to further develop interdisciplinary theory and methodology applicable for analysis of high-dimensional tabular data, with a special focus on statistical disclosure limitation. Tabular data are a staple product in disseminating data derived from confidential microdata that fuels social science research and informs policy decisions. As the amount of tabular data accumulates in public, and record linkage methodologies improve, so does the threat to our confidentiality and privacy. This project explores the practical questions: (1) What social science data, releasable from a table with small counts, will maintain confidentiality? and (2) Will the released data be useful for statistical inference? The methodological aspects of this research deal with complete and incomplete characterizations of probability distributions for k-way contingency tables using marginals, conditionals, and odds ratios, and tools from log-linear models, probability, directed acyclic graphs, and algebraic geometry. Complete specifications are associated with unique identification of the full joint distribution, i.e., full disclosure, with the maximum utility. Given observed partial information based on an arbitrary collection of conditionals and marginals, the bounds and distributions on the cell entries tell us what values the cells can take; thus can be used for risk and utility assessment. The bounds can be calculated via linear and integer programming. Tools from algebraic geometry can be used for the calculation of bounds and induction of distributions. This project evaluates the applicability and effectiveness of the current methodology for high-dimensional discrete data; it will study the disclosure risk and data utility of conditional and marginal releases in large often sparse tables. This research improves the current methodology by studying the effects of rounding of probability values when reporting tables of rates on sharpness of bounds. The project will start developing a broadly applicable theory for assessing distributions over the space of tables given observed partial information based on arbitrary collections of marginals and conditionals. New results in this area will advance the current frontier of statistical and computational theory by introducing and confirming new statistical models.
Until recently nothing was known about the effects on confidentiality of releasing tables of rates. Releasing conditional distributions for high-dimensional contingency tables could be useful for social science researchers assessing causal inference while still maintaining confidentiality. The results of this research will further the connections between statistical disclosure limitation, discrete multivariate statistical theory, and computational algebraic geometry. The project increases awareness of data privacy issues in both the statistics and social sciences research communities, promotes research on statistical disclosure limitation, and recruits young scholars to study new statistical methods applicable to social and behavioral sciences. This research provides government agencies and public health researchers with new tools for evaluating the safety and utility of high-dimensional tabular data releases. These tools and data releases complement current dissemination practices of public use data files, and offer more flexibility in sharing data and research results while maintaining confidentiality. This award was supported as part of the fiscal year 2005 Mathematical Sciences priority area special competition on Mathematical Social and Behavioral Sciences (MSBS).
Status | Finished |
---|---|
Effective start/end date | 10/1/05 → 9/30/10 |
Funding
- National Science Foundation: $260,000.00