Private Data Release via Learning Thresholds

Private Data Release via Learning Thresholds
M. Hardt and G. Rothblum and R. Servedio.
In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2012, pp. 168-187.

Abstract:

This work considers {\em computationally efficient} privacy-preserving data release. We study the task of analyzing a database containing sensitive information about individual participants. Given a set of statistical queries on the data, we want to release approximate answers to the queries while also guaranteeing {\em differential privacy}---protecting each participant's sensitive data.

Our focus is on computationally efficient data release algorithms; we seek algorithms whose running time is polynomial, or at least sub-exponential, in the data dimensionality. Our primary contribution is a {\em computationally efficient} reduction from differentially private data release for a class of counting queries, to learning thresholded sums of predicates from a related class.

We instantiate this general reduction with a variety of algorithms for learning thresholds. These instantiations yield several new results for differentially private data release. As two examples, taking $\zo^d$ to be the data domain (of dimension $d$), we obtain differentially private algorithms for:

Releasing all $k$-way conjunction counting queries (or $k$-way contingency tables). For any given~$k$, the resulting data release algorithm has bounded error as long as the database is of size at least $d^{O\big(\sqrt{k\log(k\log d)}\big)}$ (ignoring the dependence on other parameters). The running time is polynomial in the database size. The best sub-exponential time algorithms known prior to our work required a database of size $\tilde{O}(d^{k/2})$ [Dwork McSherry Nissim and Smith 2006].

Releasing a $(1 - \gamma)$-fraction of all $2^d$ parity counting queries. For any $\gamma \geq \poly(1/d)$, the algorithm has bounded error as long as the database is of size at least $\poly(d)$ (again ignoring the dependence on other parameters). The running time is polynomial in the database size.

Several other instantiations yield further results for privacy-preserving data release. Of the two results highlighted above, the first learning algorithm uses techniques for representing thresholded sums of predicates as low-degree polynomial threshold functions. The second learning algorithm is based on Jackson's Harmonic Sieve algorithm [Jackson 1997]. It utilizes Fourier analysis of the database viewed as a function mapping queries to answers.

Link to full version on ArXiV

Back to main papers page