Name: CS 208 HW 3: The Centralized Curator Model Solved
SKU: 13134
Availability: InStock

5/5 - (2 votes)

Instructions: Submit a single PDF file containing your solutions, plots, and analyses. Make sure to thoroughly explain your process and results for each problem. Also include your documented code and a link to a public repository with your code (such as GitHub/GitLab). Make sure to list all collaborators and references.

Tails, Trimming, and Winsorization: In all of the parts below, the dataset is x ∈ {0,1,…,D}ⁿ. In all of the implementation parts, you should write code that takes as input D ∈ N, n ∈ N, x ∈ {0,1,…,D}ⁿ, and ε >
- Prove that the following algorithm for estimating a Trimmed mean is ε-DP and implement it in code:

+ Lap ,

where P_.₀₅and P_.₉₅are the 5th and 95th percentile of the dataset. That is, we are applying the Laplace mechanism after removing the bottom and top 5% of the dataset. (Hint: Think about Lipschitz constants.)

Prove that for large enough n, the analogous algorithm for the Winsorized mean is not ε-DP:

+ Lap,

where [ is defined as in Problem Set 2. In Winsorization, we clamp points rather than dropping them. (In class on 3/11, we incorrectly referred to dropping points as Winsorization.) Again, it may be useful to first think in terms of Lipschitz constants.

In class, we saw how to use the exponential mechanism to an estimate of the median, P_.₅. Describe and implement a version of the exponential mechanism that releases an estimate of the tth percentile P_tof a dataset x ∈ {0,…,D}ⁿany desired t ∈ [0,100]. (A direct implementation of the exponential mechanism would require explicitly calculating weights for each of the D + 1 possible outputs, which can be too slow for large values of D such as in the parts below. One way to solve this is to bin the elements into fixed, coarser intervals. Alternatively, you can sample more quickly from the output distribution of the exponential mechanism by noting that if you sort the elements of the dataset x_i₁≤ x_i₂≤ ··· ≤ x_i_n, then all elements of each interval between x_i_jand x_i_j₊₁have the same weight, so you can sample by choosing an interval with probability proportional to the sum of weights within it and then sampling uniformly from that interval. Feel free to use either solution below.)
Implement the following ε-DP algorithm for estimating a Trimmed mean of a dataset: use your algorithm from Part 1c to get ε/3-DP estimates P^ˆ_.₀₅and P^ˆ_.₉₅of the 5th and 95th percentiles, drop all datapoints that lie outside the range [P^ˆ_.₀₅,P^ˆ_.₉₅], and then use the Laplace mechanism to compute an (ε/3)-DP mean of the trimmed data. That is, your code should compute

+ Lap .

Determine whether or not the following analogue for a Winsorized mean is ε-DP: use Part 1c to get ε/3-DP estimates P^ˆ_.₀₅and P^ˆ_.₉₅of the 5th and 95th percentiles, and output

+ Lap .

You do not need to formally prove your answer, but you should at least provide an informal explanation.

The dataset csv provides the 5% PUMS Census file for Massachusetts. For ε = 1 and D = 1,000,000, compare the RMSE between DP means and the actual means for each PUMA in Massachusetts,^[1] for DP means calculated using (i) the ordinary Laplace mechanism for a mean (remembering to clamp your data to the range!) and (ii) the algorithm from Part 1d. Also show box-and-whisker plots of the DP released means for each PUMA by these algorithms, noting the true means. You should probably order these by mean income, or perhaps skew of income, or anything you think reveals an interesting pattern. Give an intuitive explanation of the kinds of datasets on which algorithm (i) is likely to perform better than algorithm (ii) and vice-versa. Describe any modifications you might propose would increase the utility (at the same level of privacy preservation) for data similar to this income example.

Composition: Suppose you have a global privacy budget of ε = 1 (and are willing to tolerate δ = 10⁻⁹) and you want to release k count queries (i.e. sums of Boolean predicates^[2]) using the Laplace mechanism with an individual privacy loss of ε₀. By basic composition, you can set

ε₀= ε/k. Using the advanced composition theorem, you can set ε₀= ε/^p2k ln(1/δ). We have provided you with code from PSI for the “optimal” composition theorem for differential privacy that calculates the largest value of ε₀that ensures global (ε,δ)-DP as a function of ε, δ, and k.^[3]For each of these choices, plot (on the same graph) the standard deviation of the Laplace noise added to each query as a function of k, and find the smallest values of k where the advanced and optimal composition theorems strictly improve upon the basic composition theorem.

Synthetic Data: Expanding the template from class, and using again csv, create a DP three-way histogram^[4] release of income, education and age. You do not need to graph this histogram, just compute the release for each binned combination of the variables. From this, you should be able to generate synthetic data of these three variables. Run a linear regression

as a post-process on your synthetic data, predicting income from education and age^[5] using the equation:

Income_i= β₀+ β₁Education_i+ β₂Age_i+ ν_i; ν_i∼ N(0,σ²) (1)

Let be the coefficients in the full sensitive data, while β^˜the DP release we generate. The mean-squared error of a DP release of β^˜can be decomposed into the contributions of bias and variance as:

MSE(β^˜) = bias(β^˜)²+ var(β^˜) = (E[β^∗− β^˜])²+ E[(β^¯˜− β^˜)²] (2)

For this calculation, we are taking the (sensitive) regression coefficients β^∗on the entire dataset as the true values of β. Show the contributions to MSE of the bias and variance of the DPregression coefficients.^[6]

As a baseline to decide if these squared bias and error terms are large, we can compute the MSE due simply to sampling, by bootstrapping with replacement new datasets in which we compute new (sensitive) regression estimates β^ˆon the bootstrapped data and compute MSE(β^ˆ). How do the bias and variance terms due to creating DP-releases compare to the this numerical estimate of the error introduced by sampling?

BONUS: Using your developed understanding of differential privacy, and the described use case in the Gaboardi et al. PSI paper, reexamine the deployed instance of the PSI budgeting tool, available at http://psiprivacy.org . Provide any feedback that you think would make the interface easier for the intended non-expert “data owner” user to budget a DP-release, or would otherwise improve the system. (Note: Insightful, considered feedback will receive 1/2 point bonus, and feedback that strikes us a revelatory or particularly intriguing idea will receive 1 point bonus and a note of thanks in a future paper draft.)
Final Project: By April 9, submit a couple of pages giving a detailed description of what your final project will look like. You should be able to clearly state your research questions, briefly articulate how your project relates to what has been done in the past, describe the approach you are taking, give your timeline for completing various aspects of the project, and discuss your fallback plan in case you don’t obtain the results that you’re hoping to obtain.

[1] You can assume that the N in each PUMA is public information.

[2] A Boolean predicate is a function that returns a 0 or a 1. An example of a count query might be the sum of bits for all college students.

[3] See the function update parameters in /examples/wk5 centralized/psiExamples.r or psiExamples.ipynb .

[4] That is, a histogram representation counting the occurrences of having all possible combinations of the three binned variables.

[5] You will likely find that log(income) has a more linear relationship with your other two variables, so feel free to shift from income to log(income) if you prefer. However, you will need to decide how to treat zero values in income; one option is to clip the lower bound of income to some small positive value.

[6] To numerically compute the expectations, simply repeat your simulation many times and average.

HW3.zip

[SOLVED] CS 208 HW 3: The Centralized Curator Model

Want to See Past Work First?

[SOLVED] CS 208 HW 3: The Centralized Curator Model

Related products

CS 208 HW 4B: Stochastic Gradient Descent and Lipschitz Extensions

CS 208 HW 4A: The Local Model

CS 208 HW 1: Reidentification, Reconstruction and Membership Attacks

Related in this category

More in this category

CS 208 HW 2: Differential Privacy Foundations

CS 208 HW 1: Reidentification, Reconstruction and Membership Attacks

CS 208 HW 4B: Stochastic Gradient Descent and Lipschitz Extensions

CS 208 HW 4A: The Local Model

Want to See Past Work First?