Identification Risks Evaluation of Partially Synthetic Data with the IdentificationRiskCalculation R Package
Ryan Hornby(a), Jingchen Hu(b),(*)
Transactions on Data Privacy 14:1 (2021) 37 - 52
(a) Vassar College, Box 2785, 124 Raymond Ave, Poughkeepsie, NY 12604, United States.
(b) Vassar College, Box 27, 124 Raymond Ave, Poughkeepsie, NY 12604, United States.
e-mail:rhornby @vassar.edu; jihu @vassar.edu
We extend a general approach to evaluating identification risk of synthesized variables in partially synthetic data. For multiple continuous synthesized variables, we introduce the use of a radius r in the construction of identification risk probability of each target record, and illustrate with working examples. We create the IdentificationRiskCalculation R package to aid researchers and data disseminators in performing these identification risks evaluation calculations. We demonstrate our methods through the R package with applications to a data sample from the Consumer Expenditure Surveys, and discuss the impacts on risk and data utility of 1) the choice of radius r, 2) the choice of synthesized variables, and 3) the choice of the number of synthetic datasets. We give recommendations for statistical agencies for synthesizing and evaluating identification risk of continuous variables