Recap: PEPR 2021 — Architectures For Privacy
Overview
This post is the sixth blog post in a seven-post series recapping the PEPR 2021 conference. If you're wondering what PEPR is or want to see the other PEPR conference recaps check out this post!
The three PEPR 2021 talks on Architectures for Privacy are:
- No Threat, No Sweat: Privacy Threat Modeling in Practice – a Machine Learning Use Case
- Lightweight Purpose Justification Service for Embedded Accountability
- Deleting the undeletable
No Threat, No Sweat: Privacy Threat Modeling in Practice – a Machine Learning Use Case
In No Threat, No Sweat: Privacy Threat Modeling in Practice – a Machine Learning Use Case, Kim Wuyts and Isabel Barberá discuss privacy threat modeling through the use of the LINDDUN framework, and an extension that can be used to consider privacy risks in machine learning use cases.
LINDDUN is a privacy threat modeling methodology and a mnemonic for the threats it helps you identify: Linkability, Identifiability, Non-Repudiation, Detectability, Disclosure of Information, Unawareness, and Non-Compliance. Its goal is to help you systematically elicit and manage privacy-related threats. To do so, it begins by creating a model of a system e.g., a dataflow diagram, eliciting threats by mapping system elements and identifying relevant threats. The process concludes by assessing, prioritizing, and mitigating these threats.
To make LINDDUN more practical, lightweight, and hands-on, Kim introduces LINDDUN GO. While LINDDUN defines a general framework and approach, LINDDUN GO is a set of physical cards meant to guide and inspire a cross-disciplinary privacy threat modeling exercise. Each LINDDUN GO card includes information related to the origination of a threat, the impact or consequences, questions to check the applicability of the card to your system, and elements where this threat occurs in systems.
For a more nuanced explanation of LINDDUN GO and what it provides check out the talk!
In the second half of the talk, Isabel shares her experiences of applying LINDDUN to machine learning (ML) use cases. Isabel quickly identified that many of the threat categories described in LINDDUN did not apply to ML and some critical aspects like data quality and bias were missing. To compensate for this, Isabel introduces 4 new threat categories to LINDDUN which include: Technical ML, Ethics, Accessibility, and Security ML. Isabel also considered privacy threats at all stages in the development lifecycle of ML models i.e., design, input, modeling, and output.
After defining these extensions, Isabel describes her experience of applying LINDDUN in practice to showcase what worked well and what didn't. Based on this, Isabel identified a few open process-related questions:
- Should they add impact and likelihood to their risk determinations?
- How should the team prioritize risks?
- Should risk owners be specified on a per-threat or per-sprint basis?
- How should threats, statuses, and owners be recorded and tracked?
In terms of lessons learned, Isabel shared that having a knowledgeable facilitator for LINDDUN is critical. To compensate for this, Isabel is working on training a set of data scientists and architects to serve as privacy champions. She also suggests that LINDDUN GO sessions should be limited to a maximum of 2 hours and facilitators should ensure all participants have opportunities to contribute—especially in online formats.
Overall, LINDDUN GO helped Isabel improve the overall process and maturity of eliciting, documenting, and mitigating privacy risks. LINDDUN GO helps prevent endless discussions between stakeholders and brings focus to the discussions and subsequent decision making process. Ultimately, this process improves collaboration, reduces rework, and produces actionable artifacts for Data Protection Impact Assessments.
Lightweight Purpose Justification Service for Embedded Accountability
In Lightweight Purpose Justification Service for Embedded Accountability, Arnav Jagasia and Yeong Wee discuss how they implemented an accountability feature to help satisfy purpose limitation and purpose justification for data platforms.
Purpose limitation is the idea that data should be collected for specific, explicit, and legitimate purposes only—data should not be processed in a way that is incompatible with those purposes. Ensuring effective adherence to purpose limitation on data platforms is difficult because:
- Data platforms are designed to be flexible and support diverse use cases.
- Access controls are generally static, set at a point-in-time, and do not dynamically update as use cases evolve.
- Legal reviews and privacy impact assessments tend to be one-offs and are disconnected from everyday use of the data platform.
- Analyzing audit logs can be expensive and organizations may lack the budget or expertise to perform these audits.
Each of these processes described above are crucial components for maintaining an effective governance program for data platforms but they have their limitations.
Arnav and Yeong suggest that data platforms lack robust tooling that reminds users of purpose limitation principles—data protection policies and guidelines are not integrated into the tooling. As a result, users are more likely to use data in a way that is incompatible with the original purposes of the processing.
In other words, there is a gap between the data governance function and the platform user function. This makes it difficult to determine whether purpose limitation principles are upheld on a day-to-day basis across diverse data, users, and teams.
To bridge this gap, Arnav and Yeong showcase a generalizable approach that allows users to justify their actions in-context while enabling data governance teams to gain additional insight to ensure users are adhering to data privacy principles. It leverages user checkpointing that prompts users to specify the purpose for their action. This checkpoint does not necessarily serve as a gating function, but does allow data governance users to review what platform users are doing, why are they doing it, and verify the context under which those actions were performed.
This approach is proactive and iterative, allows governance users to continually improve on accountability through ongoing and mutually-reinforcing relationships and feedback.
Data platforms present users with a unified platform of standalone, but interoperable, back-end services like those found in microservice architectures. In a microservice-based architecture each microservice generally exposes its own API. A data governance user may choose to label a few of these endpoints as sensitive and ask data platform users to submit a purpose justification before accessing or processing the underlying data. Each of these endpoints must then in turn report details regarding justifications in a common format so they can be stored centrally.
Palantir's Purpose Justification Framework is a lightweight service deployed alongside other microservices. Its purpose is to configure purpose justification checkpoints, store user justifications, and present these justifications for review. The framework has the following goals:
- Governance Users can configure justification checkpoints for sensitive actions
- Platform Users performing sensitive actions see the configured checkpoint and associated language and then submit a user justification.
- Governance Users can review sensitive actions, along with the justification and context around those actions.
Additionally, the integration should be a low-lift for developers and not slow down development velocity. To accomplish this, Palantir provides a shared front-end library to be used by all microservices—the framework can be easily added to any existing API endpoint.
Providing a centralized and extensive list of purpose justifications raises interesting employee and workplace privacy challenges. Governance users should only be able to see the data in the governance platform if they would otherwise be able to access the underlying data exposed by a microservice. Arnav and Yeong suggest that data minimization and granular access controls can reduce the risk of governance users inadvertently reviewing purpose justifications and data they are not permitted to view.
In terms of lessons learned, governance users were able to gain greater awareness and context into the user activity in data platforms. They were also able to identify gaps in accountability procedures and improve processes and training. Platform users were periodically reminded of organizational policies and were provided clear escalation paths if they're unsure whether their actions adhere to organizational policy. Finally, developers received a lightweight approach that was easy to integrate and maintain as well as having low code complexity.
Having all these users engage in the same governance workflows in a consistent language via an embedded framework provides a governance process that interoperates and scales.
Deleting the Undeletable
In Deleting the Undeletable, Behrooz Shafiee shares how Shopify transformed its data analytics platform to facilitate data deletion requests. In addition to the PEPR talk, there's an in-depth blog post discussing this topic as well!
Shopify collects and processes countless analytical events on a day-to-day basis. These events are published via Kafka from a variety of clients and then persisted to a data warehouse where the events can be used by data analysts. However, these events frequently contained personal information (PI) and deleting all the events about a given user quickly became difficult. Behrooz begins the talk by highlighting a number of privacy problems introduced by the previous design of Shopify's data warehouse such as:
- The warehouse was immutable by design (deleting/modifying data is hard)
- Events had no guaranteed structure or schema (what is PI or not?)
- Events lacked privacy context (what data belonged to what user?)
- Modifications or deletions to tables caused cascading failures
- Privacy solutions must scale to high volumes of data and data analysts
The missing privacy context and undefined schemas caused the majority of problems for Shopify—to address this they introduced a schema system. Whenever a developer or analyst is interested in creating a new Kafka event they must define a schema. The schema includes (1) boilerplate information e.g., name, description, and version of the schema, (2) privacy context which specifies the data controller and data subject, and (3) the type of PI contained within the event and privacy handlers that specify what to do with that PI e.g., tokenize.
Shopify decided to not permit PI in their data warehouse. To achieve this, Shopify relies on two pseudonymization techniques: obfuscation (masking, redacting, and generalizing data) and tokenization. When tokenizing data, data types like email addresses are replaced with a consistent random token which is stored in a separate service. Once incoming data has been pseudonymized, Shopify's data warehouse is left with 3 categories of data: non-personal data, obfuscated data, and tokenized data.
If Shopify receives a data deletion request for the first 2 categories (non-personal or obfuscated data) there's not much to do—there's no personal data left. However, tokenized data is considered identifiable as long as the mapping between PI and the token exists. The remainder of the talk details how deletion of tokenized data can be effectively achieved for different data controllers and data subjects.
Shopify is primarily concerned with two use cases: (1) a user or shopper who wants to delete their data from a specific vendor or all vendors, and (2) a vendor who wants to leave Shopify and delete all of their customer data.
To achieve this, Shopify generates a new token for each identifier e.g., a customer email on a per-vendor basis. If Alice buys a product from Allbirds and Gymshark, a consistent random identifier will be created for each of the vendors e.g., Token123 and Token456 respectively. However, if Alice buys a second product from Allbirds, it will re-use the Token123 identifier—this ensures that data analysts can attribute purchases from the same vendor with the same customer without divulging the underlying PI.
So how does this help with deletion requests?
If Alice wants Gymshark to delete her data, the mapping between her email and Token456 can be deleted. Since this token does not rely on Alice's PI e.g., her email address, it's unlikely the tokenized data could be re-attributed to her. This easily extends to deleting Alice's data for all vendors by specifying a wildcard for the data controller. The same principle works for merchants wishing to leave the platform.
It's important to note that at no point did Shopify need to search or perform any lookups in the data warehouse. Instead, all of these deletions occurred within the service that maintains mappings between tokens and PI.
To wrap things up, Behrooz shares 10 critical lessons learned:
- Presenting solutions vs. achieving adoption are different problems
- Make the right thing to do the default option
- Prove that your solution is scalable and accurate in a language/context that is understandable by your end-users
- Tooling should bring value in its own right vs. introducing new barriers
- Organizational alignment and consistent messaging is crucial
- A dedicated taskforce help the project survive multi-year efforts
- Spend time on tooling, documentation, and support to ease onboarding pain
- Unstructured data is evil and causes usability and maintainability issues
- Pseudonymization helps track and handle personal data in large datasets
- Technical challenges are not the same as organizational challenges
Wrapping Up
I hope these posts have piqued your interest in PEPR 2021 and future iterations of the conference. Don't forget to check out the other Conference Recaps for PEPR 2021 as well!
If you liked this post (or have ideas on how to improve it), I'd love to know as always. Cheers!