TY - GEN
T1 - On estimating the swapping rate for categorical data
AU - Kifer, Daniel
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/8/10
Y1 - 2015/8/10
N2 - When analyzing data, it is important to account for all sources of noise. Public use datasets, such as those provided by the Census Bureau, often undergo additional perturbations designed to protect confidentiality. This source of noise is generally ignored in data analysis because crucial parameters and details about its implementation are withheld. In this paper, we consider the problem of inferring such parameters from the data. Specifically, we target data swapping, a perturbation technique commonly used by the U.S. Census Bureau and which, barring practical breakthroughs in disclosure control, will be used in the foreseeable future. The vanilla version of data swapping selects pairs of records and exchanges some of their attribute values. The number of swapped records is kept secret even though it is needed for data analysis and investigations into the confidentiality protection of individual records. We propose algorithms for estimating the number of swapped records in categorical data, even when the true data distribution is unknown.
AB - When analyzing data, it is important to account for all sources of noise. Public use datasets, such as those provided by the Census Bureau, often undergo additional perturbations designed to protect confidentiality. This source of noise is generally ignored in data analysis because crucial parameters and details about its implementation are withheld. In this paper, we consider the problem of inferring such parameters from the data. Specifically, we target data swapping, a perturbation technique commonly used by the U.S. Census Bureau and which, barring practical breakthroughs in disclosure control, will be used in the foreseeable future. The vanilla version of data swapping selects pairs of records and exchanges some of their attribute values. The number of swapped records is kept secret even though it is needed for data analysis and investigations into the confidentiality protection of individual records. We propose algorithms for estimating the number of swapped records in categorical data, even when the true data distribution is unknown.
UR - http://www.scopus.com/inward/record.url?scp=84954119764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84954119764&partnerID=8YFLogxK
U2 - 10.1145/2783258.2783369
DO - 10.1145/2783258.2783369
M3 - Conference contribution
AN - SCOPUS:84954119764
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 557
EP - 566
BT - KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
Y2 - 10 August 2015 through 13 August 2015
ER -