The goal of this document is to describe the differential privacy approach taken by synthetic data showcase (SDS) to aggregate and synthesize data.
Approach based on Differentially Private Marginals.
To produce synthetic data, SDS: (i) generates differently-private marginals (also called differently-private aggregates in this document), then (ii) derives synthetic records from these aggregates in a way that retains the same differential privacy guarantees under the post-processing property.
|--------------| (i) |---------------| (ii) |-------------------|
| Tabular data | -----> | DP aggregates | ------> | DP synthetic data |
|--------------| |---------------| |-------------------|
Aggregate counts or marginals are the counts of
$k$ -tuples in the data representing certain combinations of attributes.
Let
For a given tabular data input, the set of all non-empty
For example, given the following dataset and a reporting length of 3
A | B | C |
---|---|---|
a1 | b1 | c1 |
a1 | b2 | c1 |
a2 | c2 | |
a2 | b2 | c1 |
a1 | b2 |
Then
$count(A = a1) = 3$ $count(A = a2) = 2$ $count(B = b1) = 1$ $count(B = b2) = 3$ $count(C = c1) = 3$ $count(C = c2) = 1$ $count(A = a1, B = b1) = 1$ $count(A = a1, B = b2) = 2$ $count(A = a1, C = c1) = 2$ $count(A = a2, B = b2) = 1$ $count(A = a2, C = c1) = 1$ $count(A = a2, C = c2) = 1$ $count(B = b1, C = c1) = 1$ $count(B = b2, C = c1) = 2$ $count(A = a1, B = b1, C = c1) = 1$ $count(A = a1, B = b2, C = c1) = 1$ $count(A = a2, B = b2, C = c1) = 1$
In the context of the aggregates, the sensitivity of a record is defined as the maximum number of
$k$ -tuples that can be generated from the record.
Let
The sensitivity can be interpreted as the maximum number of contributions a record can make to the aggregate data, i.e., the maximum number of non-empty combinations with length
In this way, the overall sensitivity
For example, given the following dataset and
ID | A | B | C |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a1 | b2 | c1 |
3 | a2 | c2 |
Then:
$\Delta_{k=2,j=1} = \binom{3}{2} = |(A = a1, B = b1), (A = a1, C = c1), (B = b1, C = c1)| = 3$ $\Delta_{k=2,j=2} = \binom{3}{2} = |(A = a1, B = b2), (A = a1, C = c1), (B = b2, C = c1)| = 3$ $\Delta_{k=2,j=3} = \binom{2}{2} = |(A = a2, C = c2)| = 1$ $\Delta_{k=2} = max |M_{2_j}| = max(3, 3, 1) = 3$
To ensure differential privacy guarantees for the aggregate data, noise needs to be added to the reported counts. Furthermore, based on the dataset's domain, there might be some attribute combinations that do not exist in the original dataset. For example:
ID | A | B | C |
---|---|---|---|
1 | a1 | b1 | c1 |
2 | a1 | b2 | c1 |
3 | a2 | c2 |
To illustrate the process of adding noise to the aggregate data, let's consider the example above and a reporting length of 2
- Column A:
a1, a2
- Column B:
b1, b2
- Column C:
c1, c2
From this domain, we can infer the possible attribute combinations and their sensitive counts:
-
$k = 1$ :$count(A = a1) = 2$ $count(A = a2) = 1$ $count(B = b1) = 1$ $count(B = b2) = 1$ $count(C = c1) = 2$ $count(C = c2) = 1$
-
$k = 2$ :$count(A = a1, B = b1) = 1$ $count(A = a1, B = b2) = 1$ $count(A = a1, C = c1) = 2$ $count(A = a1, C = c2) = 0$ $count(B = b1, C = c1) = 1$ $count(B = b1, C = c2) = 0$ $count(B = b2, C = c1) = 1$ $count(B = b2, C = c2) = 0$
This means that noise needs to be added to the sensitive counts, and if
Since the algorithm is iterative, we repeat this process starting with
$k = 1$ and continue up to and including$k = R$ , where$R$ is the reporting length. We start by selecting combinations that satisfy equation [1] for$k = 1$ , then in the next iteration, only$k$ -tuples including the surviving$(k-1)$ -tuples from the previous iteration are considered for sampling.
According to Differentially Private Marginals, the noise added to each
Notice that the noise level is directly related to the overall sensitivity
In order to decrease the noise, we can use a differentially-private percentile technique to select
To lower the sensitivity for a given combination length
$k$ , we need to randomly drop$k$ -tuples from records to make sure the record's sensitivity will not exceed the "allowed sensitivity" computed with DP.
From Differentially Private Marginals, to satisfy
(EQ1)
Assuming the total privacy budget to be
(EQ2)
If a
Besides, based on EQ1 and EQ2 we can:
- Call
$\rho=(\sqrt{\varepsilon_M + \ln(2/\delta)} - \sqrt{\ln(2/\delta)})^2$ - Define
$Q_{p}$ as the proportion of the privacy budget dedicated for finding$Q^{th}$ percentiles - Define
$N_{p}$ the proportion of the total privacy budget dedicated for finding the protected number of records - Call
$\varepsilon_N = N_{p} * \varepsilon$ and$\varepsilon_M = \varepsilon - \varepsilon_N$
From the above assumptions we know that (i)
(ii) directly tells us that
On the other hand, to solve (iii) and find the
$\sigma_1 = p_1 * \sigma$ $\sigma_2 = p_2 * \sigma$ - ...
$\sigma_R = p_R * \sigma$
Thus:
To summarize, to control the allocation of the privacy budget
-
Total privacy budget
=$\varepsilon$ -
Delta
=$\delta$ , which can also be inferred from the protected number of records -
Percentile epsilon proportion
=$Q_p$ , where$0 < Q_p < 1$ -
Number of records epsilon proportion
=$N_p$ , where$0 < N_p < 1$ -
Sigma proportions
=$[p_1, p_2, ..., p_k]$ , where$p_k > 0$
To illustrate the sigma proportions, let's assume a reporting length of
-
$[p_1=1.0, p_2=1.0, p_3=1.0]$ : this evenly splits the privacy budget across the combination lengths, resulting in$\sigma_1 = \sigma_2 = \sigma_3 = \sigma$ -
$[p_1=1.0, p_2=\frac{1}{2}, p_3=\frac{1}{3}]$ : this prioritize smaller errors for longer combination lengths, resulting in$\sigma_1 = \sigma; \sigma_2 = \frac{\sigma}{2}; \sigma_3 = \frac{\sigma}{3}$
As explained above, the thresholds
As described in Differentially Private Marginals, the threshold for the
Note that
SDS allows
- only
$2$ -tuples with noisy count$> 10$ will be included in the aggregate data - only
$3$ -tuples with noisy count$> 20$ will be included in the aggregate data
Since fabricated combinations will have their counts sampled from
SDS allows
Notice that when
To illustrate
-
$[\eta_2 = 1.0, \eta_3 = 1.0, \eta_4 = 1.0]$ : leaves the fabrication at its maximum -
$[\eta_2 = 0.01, \eta_3 = 1.0, \eta_4 = 1.0]$ : tries to minimize fabrication for the$2$ -tuples -
$[\eta_2 = 0.1, \eta_3 = 0.55, \eta_4 = 1.0]$ : linearly decreases the amount of fabrication that is being filtered as the combination length grows
As mentioned above, notice the tradeoff: the more fabrication we filter, more attribute combinations will be suppressed from the reported aggregate data.
Notice that filtered
$2$ -tuples will propagate to$k$ -tuples, with$k > 2$ . So if a given$2$ -tuple is filtered, the$k$ -tuples with$k > 2$ containing it will not be reported.
Since attribute combinations counts will have noise added to them, some will be fabricated, and others suppressed, a
$count(A = a1, B = b1) = 25$ $count(A = a1, B = b1, C = c1) = 30$ $count(A = a1, B = b1, C = c2) = 40$
In the original data, this does not happen: for any given
To retain this property in the aggregate data while also preserving DP, SDS will normalize the reported noisy-counts to follow this rule:
$count(A = a1, B = b1) = 25$ $count(A = a1, B = b1, C = c1) = 25^*$ $count(A = a1, B = b1, C = c2) = 25^*$
* Notice this decision is based in another noisy count to keep the same DP-guarantees.
High level code to compute aggregate counts with DP:
-
epsilon
$\gets \varepsilon$ -
delta
$\gets \delta$ -
reporting_length
$\in [1,\infty)$ -
percentile
$\in [1,100]$ -
percentile_epsilon_proportion
$\in (0, 1)$ -
sigma_proportions
$\gets [p_1, ..., p_k]$ -
threshold_type
$\gets fixed | adaptive$ -
thresholds
$\gets [t_1, ..., t_k]$ - input_data
This is just pseudo-code that will not run if you copy and paste it.
aggregate_data = {}
# find the gaussian scales for each combination length k
# and the privacy budget dedicated to the percentile with DP
sigmas, percentile_epsilon = split_privacy_budget(
reporting_length,
epsilon,
delta,
percentile_epsilon_proportion,
sigma_proportions
)
# loop from k = 1...reporting_length
for k in range(1, reporting_length + 1):
if k == 1:
# find the set of single attributes from the data
#
# all the attributes start with a count of 0
k_tuples = find_single_attributes(input_data)
else:
# only surviving tuples from the previous iteration
# will be used
#
# all the tuples will start with a count of 0
k_tuples = find_k_tuples_based_on_previous_iteration(
input_data,
k,
aggregate_data[k - 1]
)
# find all valid k-tuples for each record
#
# valid k-tuples are the ones that have survived
# until this iteration
k_tuples_by_record = find_valid_k_tuples_by_record(
input_data,
k_tuples,
k
)
# compute the allowed sensitivity using DP-percentile
allowed_sensitivity = compute_allowed_sensitivity_with_dp(
k_tuples_by_record,
percentile,
percentile_epsilon / k
)
# randomly pick tuples for each record and increment
# their counts until we reach the allowed sensitivity
#
# this will update the k_tuple counts from 0 to counts
# that ensure the allowed sensitivity is not exceeded
k_tuples = increment_counts_based_on_allowed_sensitivity(
k_tuples,
k_tuples_by_record,
k,
allowed_sensitivity
)
# add gaussian noise to the current counts
k_tuples = add_gaussian_noise(
k_tuples,
sigmas[k],
allowed_sensitivity
)
# compute rhos based on the threshold types
# and provided threshold values
rhos = compute_rhos(
threshold_type,
thresholds,
sigmas[k],
allowed_sensitivity
)
# retain counts based on rhos[k]
k_tuples = retain_k_tuples_based_on_rho(k_tuples, rhos[k])
aggregate_data[k] = current_k_tuples
aggregate_data = normalize(aggregate_data)
SDS synthesizes data directly from the differently-private aggregates, without querying the sensitive data. This way, the generated synthetic data will preserve the same guarantees present in the aggregates computed with differential privacy.
The synthesis mode used to synthesize data from the DP aggregates is called aggregate-seeded. The idea behind it is to sample attributes and build records trying to match not only the single attribute count distributions in the aggregate data, but also all the reported
$count(A = a1) = 3$ $count(A = a2) = 2$ $count(B = b1) = 1$ $count(B = b2) = 3$ $count(C = c1) = 3$ $count(C = c2) = 1$ $count(A = a1, B = b1) = 1$ $count(A = a1, B = b2) = 2$ $count(A = a1, C = c1) = 2$ $count(A = a2, B = b2) = 1$ $count(A = a2, C = c1) = 1$ $count(A = a2, C = c2) = 1$ $count(B = b1, C = c1) = 1$ $count(B = b2, C = c1) = 2$ $count(A = a1, B = b1, C = c1) = 1$ $count(A = a1, B = b2, C = c1) = 1$ $count(A = a2, B = b2, C = c1) = 1$
Then the aggregate-seeded synthesis will attempt to reproduce these distributions in the output dataset.
The following sections explain in more detail how aggregate-seeded synthesis works.
The general concept behind this synthesis can be expressed by the following algorithm:
-
reporting_length
$\leftarrow$ the same used to generate the aggregate data - aggregate_data
This is just pseudo-code that will not run if you copy and paste it.
# this will get the 1-tuple counts from the aggregate_data
#
# therefore, this will contain only the single attribute counts
single_attributes_available = count_single_attributes(aggregate_data)
# stores the resulting synthetic records
synthesized_records = []
# this keeps track of the already synthesized k-tuple counts
already_synthesized_counts = {}
# we keep going, while we have not used all the available
# single attributes
while len(single_attribute_available) > 0:
synthetic_record = []
while True:
# sample the next attribute
attr = sample_next_attr(
aggregate_data,
synthetic_record,
single_attributes_available,
already_synthesized_counts
)
# if we could sample one more attribute
# add it to result
if attr != None:
synthetic_record.append(attr)
# now we decrement the used attribute count
single_attributes_available[attr] -= 1
# if the count reaches 0, this attribute
# will not be available for sampling anymore
if single_attributes_available[attr] == 0:
del single_attributes_available[attr]
else:
break
# update the already synthesized k-tuples counts
for k in range(1, reporting_length + 1):
for comb in combinations(synthetic_record, k):
already_synthesized_counts[comb] += 1
synthesized_records.append(synthetic_record)
Notice that the overall algorithm described above includes a function called sample_next_attr
. The goal of such function is to sample a new attribute to be added to the currently synthesized record.
This function works by randomly sampling attributes from the list of single attributes available. Despite having a level of randomness, this function will try to follow the same distribution of counts present in the aggregate data.
Generally speaking, it performs sampling based on weights, so let's consider the following scenario:
Attribute | Weight |
---|---|
a1 | 1 |
b1 | 5 |
c1 | 10 |
Even though all of the 3 attribute above have a chance of being sampled,
Also, when sampling it is important to ensure that:
-
Attributes already present in the currently synthesized record cannot be sampled again
-
Only attribute from columns that are not present in the synthetic record yet can be sampled (e.g. If
$a1$ from column A has been already sampled,$a2$ also from column A cannot) -
Attributes that create attribute combinations that do not exist in the DP aggregate data when added to the currently synthesized record cannot be sampled (e.g., if the currently synthesized record is
$(A = a1, B = b1, C = c2)$ and we try to sample$D = d2$ , if, for instance,$(B = b1, D = d2)$ is not reported in the aggregate data, sampling$D = d1$ is not an option)- This guarantees that the synthesis will not fabricate new attribute combinations. However, combinations fabricated during aggregation with DP are reported in the aggregate data and might therefore also appear in the synthetic data.
On the other hand, we also need to calculate the weights that are used for sampling. Adding a new attribute to the current synthetic record creates a new record having a number of attributes either: (i) less than or equal to the reporting length, or (ii) greater than the reporting length.
Let's say the currently synthesized record is
-
$C = c1 \rightarrow (A = a1, B = b1, C = c1)$ ;$count(A = a1, B = b1, C = c1) = 1$ -
$C = c2 \rightarrow (A = a1, B = b1, C = c2)$ ;$count(A = a1, B = b1, C = c2) = 5$ -
$D = d1 \rightarrow (A = a1, B = b1, D = d1)$ ;$(A = a1, B = b1, D = d1)$ does not exist in the aggregate data
This means that
For as long as the size of the synthesized record plus the new attribute candidate for sampling does not exceed the reporting length, the weight can be a direct lookup in the aggregate data. However, if it does exceed the reporting length, we can proceed by doing the following:
-
synthetic_record
$\leftarrow$ record already synthesized so far -
attr_candidate
$\leftarrow$ attribute candidate we want to calculate the weight for -
weight_selection_percentile
$\leftarrow$ the percentile we want from all the weights candidates (default 95) - aggregate_data
This is just pseudo-code that will not run if you copy and paste it.
# store all weight candidates for the percentile technique
weight_candidates = []
# current record with the new attr candidate added to it
current_candidate = synthetic_record + attr_candidate
# for every combination length from 1 up to and including the
# reporting length
for k in range(1, reporting_length + 1):
# get all the sub-combinations that include the
# attribute we want to get the weight for
for sub_combination in combinations(current_candidate):
if attr_candidate in sub_combination:
if sub_combination in aggregate_data:
weight_candidates.append(aggregate_data[sub_combination])
else:
# adding the attribute would create
# a combination that does not exist in
# aggregate data - we will not do that
return None
if len(weight_candidates) == 0:
return None
# calculate the percentile that will represent the
# weight for the attribute candidate
return percentile(weight_candidates, weight_selection_percentile)
Notice that the aggregate counts are already DP, so we don't need differential privacy to compute the percentile of weight candidates here.
So far, every time the aggregate data is queried, either to directly get a weight for sampling, or to build the list of weight candidates for the percentile technique, the initial aggregate counts are used irrespective of the attributes already accounted for in the synthetic data.
Sometimes, depending on the dataset, it might be desirable to use the count of the already-synthesized attributes to help guide the sampling process.
If we look back to the overall algorithm, we already keep track of the already synthesized attribute counts already_synthesized_counts
. So, when we perform the lookup in the aggregate data, instead of using the raw count, we could use: aggregate_data[sub_combination] - already_synthesized_counts[sub_combination]
. In this way, attributes that have already been sampled will have a lower chance of being sampled again.
This is controlled by the
use_synthetic_counts
flag in SDS.