Recap: PEPR 2020 — Privacy-Preserving Data Analysis

Overview

This post is the second blog post in a seven-post series recapping the PEPR 2020 conference. If you're wondering what PEPR is or want to see the other PEPR conference recaps check out this post!

The three PEPR 2020 talks on Privacy-Preserving Data Analysis are:

  1. Building and Deploying a Privacy-Preserving Data Analysis Platform
  2. A Differentially Private Data Analytics API at Scale
  3. Improving Usability of Differential Privacy at Scale

Building and Deploying a Privacy Preserving Data Analysis Platform

In Building and Deploying a Privacy Preserving Data Analysis Platform, Frederick Jansen presents lessons learned by deploying Secure Multiparty Computation (MPC) to answer questions about wage disparity.

Frederick suggests one of the biggest barriers to implement MPC is communicating the technical capabilities to stakeholders and developing a sense of trust. There was uncertainty whether organizations could participate due to various legislation and contractual obligations. The speaker shared a few analogies which worked to varying degrees to explain MPC at an approachable level for less technical stakeholders.

While there is existing Business to Business drop-in infrastructures for MPC, it was not accessible or scalable to a large number of non-technical businesses with limited support. Data validation and correction was a difficult problem and was potentially attributed to poor planning and bad assumptions. There are also several issues associated with quantifying privacy risks in MPC e.g., how many participants are needed and which algorithms are appropriate?

A Differentially Private Data Analytics API at Scale

In A Differentially Private Data Analytics API at Scale, Ryan Rogers presents LinkedIn's new differentially private analytics platform to support external marketing partners. To start, Ryan provides a brief introduction to differential privacy (DP) and its various models i.e., local and global DP.

LinkedIn's Audience Engagement API provides differentially private results for top-k queries. They built their solution on top of existing top-k' solvers to answer questions like "what are the top-10 articles viewed by all data scientists?" They apply DP to the aggregated results e.g., histograms, and release these queries to marketers. Rogers mentions two problems that need to be addressed:

  1. How much can a single user affect the outcome of these queries?
  2. How many queries can the marketer ask?

To address these questions, Ryan uses the concept of sensitivity. Sensitivity is a measure of whether or not a user may contribute to one or many buckets for a particular query—different algorithms are used for these cases. Additionally, they implemented a Privacy Budget Management system that determines whether a given query may proceed based on 1) the cost of the query and 2) the remaining privacy budget for that marketer.

Improving Usability of Differential Privacy at Scale

In Improving Usability of Differential Privacy at Scale, Miguel Guevara and Milinda Perera continue the conversation on Differential Privacy (DP). DP requires several parameters including epsilon, delta, and clamping which are difficult to choose and must be determined on a case-by-case basis by end-users—the difficulty of choosing these parameters is the primary motivator of this work.

Miguel and Milinda present a dashboard that allows users to self-service and experiment with various DP parameters and how they affect the utility and privacy trade-offs of the results. The dashboard allows you to specify parameters like epsilon and delta, but also allows you to pick the noise type e.g., Laplace v. Gaussian, and whether to use dataset or partition-level privacy. Users can specify the sensitivity or max contributions per partition, and other non-DP parameters like filtering the results to a specific range.

In addition to the above, users are provided Anonymization Stats like the threshold (partitions with contributions under this value will be dropped), noise standard deviation, how many anonymized partitions you retain, and how accurate your anonymized result is compared to your raw result. Various histograms allow users to visualize the effects of parameter modification e.g., which distributions were dropped, how the noise is distributed across partitions, and how many partitions have been thresholded.

Wrapping Up

I hope these posts have piqued your interest in PEPR 2020 and future iterations of the conference. Don't forget to check out the other Conference Recaps for PEPR 2020 as well!

If you liked this post (or have ideas on how to improve it), I'd love to know as always. Cheers!