6 min read

Recap: PEPR 2021 — Data Deletion

Missed PEPR 2021 and want a recap of the Data Deletion talks? Read this.

Overview

This post is the fourth blog post in a seven-post series recapping the PEPR 2021 conference. If you're wondering what PEPR is or want to see the other PEPR conference recaps check out this post!

The three PEPR 2021 talks on Data Deletion are:

  1. Deletion Framework: How Facebook Upholds its Commitments Towards Data Deletion
  2. The Life-Changing Magic of Tidying Up Data
  3. “A Little Respect” – Erasure of Personal Data in Distributed Systems

Deletion Framework: How Facebook Upholds its Commitments Towards Data Deletion

In Deletion Framework: How Facebook Upholds its Commitments Towards Data Deletion, Benoît Reitz presents a birds-eye of the graph deletion framework used at Facebook. Benoît begins by presenting an overall system design, provides details around platform guarantees like scheduling and eventual completion, and concludes by sharing how the platform is monitored to help ensure correctness.

Note: There are a lot of visuals in this talk and I'd definitely recommend checking it out for a more in-depth explanation! There are some wonderful examples that I just wouldn't be able to do justice in this short conference recap.

Benoît begins by describing a data definition language that is used to programmatically generate deletion logic based on product-specific schemas. Product teams can define where the data is stored, how to walk the deletion graph, and how to place deletion constraints on resources. Facebook's graph deletion uses a Depth-First Search and follows the general process of:

  1. Read the Object e.g., User, Post, or Comment Object
  2. Write the Object to the "Persistent Stack"
  3. Write the associations to the "Restoration Log"
  4. Self delete the Object
  5. Recursively delete all associated Objects

The 4 platform guarantees that are essential to the deletion framework are: scheduling, eventual completion, eventual completeness, and restoration.

Scheduling is essential to enable certain ephemerality functionality at Facebook to support things like Facebook stories and user account deletion grace periods. It supports scheduling events years in the future and also allows product teams to specify custom Time to Live logic e.g., delete this post 9 days after the last comment.

Eventual completion ensures that every deletion that is started ultimately finishes. If a deletion is interrupted due to infrastructure failures or bugs, they are automatically retried if the error is transient or escalated for manual review. On the other hand, eventual completeness ensures that orphaned data originating from bugs, race conditions, or misconfigurations are cleaned up retroactively. Finally, the framework requires restoration via the restoration log to ensure deleted objects, if erroneously deleted, can be restored to their proper state.

All of these behaviors must also be monitored and auditable.

Benoît states that it's not sufficient to just measure the "happy path's" reliability, one should also measure how much falls through the gap, as well as the effectiveness of any underlying safety nets. Facebook measures things like the success rate of deletions, how many deletions are rescheduled or not completed within a day, and how many retroactive remediations are needed to resolve orphaned data.

The Life-Changing Magic of Tidying Up Data

In The Life-Changing Magic of Tidying Up Data, Nandita Rao showcases four case studies that illustrate the benefits of applying data minimization throughout the data life cycle. Data growth and data sprawl have led to an increased risk of data breaches and it's important to manage data through the stages of creation, storage, use, and deletion.

Each of Nandita's takeaways focuses on one of the data lifecycle stages and yielded a few key takeaways:

  1. Data inventories created from surveys and spreadsheets are good starting points, but not sustainable. These should be automated as much as possible.
  2. Data governance serves as the foundation for privacy programs at scale.
  3. Privacy initiatives provide cross-functional benefits and require cross-functional buy-in.
  4. Data minimization should be used to guide strategic company decisions, not just for regulatory compliance.

For a detailed breakdown of each of the case studies, their problems, proposed solutions, and learnings, I really recommend you check out this talk!

Case Study 1 (Data Creation): The first case study focuses on a company that collects large volumes of data but would like to leverage privacy as a competitive advantage, comply with privacy regulations, and avoid ubiquitous profiling of users. Like many companies, they have a limited budget and have no dedicated privacy resources.

To accomplish these goals, they created a catalog of privacy attributes to establish the initial scope of the company's privacy program. They created a compliance mapping of all relevant industry-specific and international privacy requirements to better understand their legal obligations. To satisfy certain legal requirements, they also developed models which prohibit the aggregation of certain attributes to avoid profiling users. Their biggest remaining problem is to ensure the privacy catalog is kept current with periodic scans and privacy reviews for any data attributes that may be added or changed.

Case Study 2 (Data Storage): This case study considers an automotive company with an established privacy program. The company wants to create a single source of truth for personal data, expand its data mappings, and automate GDPR data subject rights.

Through automated metadata and content scans, they were able to identify and classify personal data. This enabled them to create more robust data mappings that were tied to the data source itself. They also found 20% of the personal data they were responsible for existed in unapproved locations like SharePoint and other drives. Once these inventories were created, they were able to concentrate their data protection efforts around key storage locations.

Case Study 3 (Data Use): The third case study discussed highlights a company with ~50 distinct brands, each of which has hundreds of vendors. They wanted to gain visibility into these third-party relationships and the associated data flows, implement a consistent consent framework, and satisfy "Do Not Sell" requirements under CCPA.

To accomplish this they monitored data in motion to validate what data was being shared with which third parties. They could then assess at a granular, attribute level whether these data flows adhered with their contractual agreements. They found occurrences of inappropriate data sharing in 5% of data flows due to issues around terminated contracts, disabled accounts, the use of personal email accounts, or the sharing of more data than contractually approved. This automation allowed them to provide more robust records and monitoring to comply with regulations like CCPA and GPDR.

Case Study 4 (Data Deletion): Finally, the fourth case study considers a healthcare company experiencing substantial storage growth. They would like to reduce storage costs, limit HIPAA compliance risks, and mature their privacy program.

To identify what data should be deleted they leveraged regular expressions, pattern matching, and context-based classifiers. They also created automated workflows to proactively delete data that had fallen out of compliance with data retention policies. The analysis allowed the company to save $2 million per year by deleting data that was redundant, outdated, abandoned, or otherwise unneeded. They were also able to leverage their newly created data inventory to improve the performance of out-of-the-box Data Loss Prevention rules.

“A Little Respect” – Erasure of Personal Data in Distributed Systems

In “A Little Respect” – Erasure of Personal Data in Distributed Systems, Neville Samuell walks you through how you might approach data deletion in a mock distributed system.

Neville begins by introducing some key concepts, why you may want or need to delete personal data, and how you might approach these problems. The mock application includes various datastores e.g., a SQL database, a data analytics warehouse, or a Redis cache. Each of these datastores has its own unique data deletion challenges.

When considering a SQL database, the first data deletion approach that you may consider may involve deleting entire database rows associated with a particular user. However, if data is copied and spread around this may be ineffective. When deleting data you may inadvertently create orphaned records that still contain personal data but lack direct references back to the original data subject. You may also delete non-personal data that may be required for other business purposes or regulatory needs e.g., financial reporting.

Instead of deleting entire rows, you may want to consider a more granular approach. When done carefully, deleting specific fields related to a user may be more appropriate. After receiving a data deletion request the system would identify all data related to an identifier e.g., an email address, and then erase all personal data in the associated rows.

For this approach, the erasure method may differ based on different data types (strings, numbers, timestamps, etc.), different data categories (emails or geographic locations), and application-specific constraints. The system may choose to completely nullify data types like home addresses, cities, or zipcodes, or you may choose to pseudonymize identifiers like email addresses.

For a single SQL database, this seems reasonable, but how does this scale to a data analytics warehouse?

Data analytics warehouses often contain chains of related tables. They are denormalized and contain multiple duplicates of the same data, and are frequently partitioned by dates rather than user identifiers. These attributes make deleting data in these types of datastores difficult, error-prone, and slow. The best way to delete data in data warehouses? Don't store it in the first place.

The approach for Redis is slightly different. Neville suggests that you may be able to leverage the built-in cache expiration to achieve the desired deletion result. That is, if your cache expiration is set to less than 30 days (or your respective regulatory timeline), Redis could self-delete this data on a fixed timeline.

Unfortunately, production systems are rarely this simple.

Production systems contain hundreds or thousands of datastores and the deletion logic for each of these would need to be customized—we also haven't considered unstructured datastores. The best way to ensure effective data deletion practices is to design and build respectful systems rather than convenient systems.

While it may be convenient to capture diverse types of data in high volumes, it may not be respectful to your users. Instead, practice data minimization and be respectful of the personal data you are collecting. Avoid creating unnecessary copies. Consider maintaining metadata that outlines what categories of data are collected, why they are needed, where they are stored, and for how long.

Wrapping Up

I hope these posts have piqued your interest in PEPR 2021 and future iterations of the conference. Don't forget to check out the other Conference Recaps for PEPR 2021 as well!

If you liked this post (or have ideas on how to improve it), I'd love to know as always. Cheers!