Recap: PEPR 2021 — Healthcare Privacy

Overview

This post is the fifth blog post in a seven-post series recapping the PEPR 2021 conference. If you're wondering what PEPR is or want to see the other PEPR conference recaps check out this post!

The three PEPR 2021 talks on Healthcare Privacy are:

  1. Towards More Informed Consent for Healthcare Data Sharing
  2. Building a Scalable, Machine Learning-Driven Anonymization Pipeline for Clinical Trial Transparency
  3. Building for Digital Health

In Towards More Informed Consent for Healthcare Data Sharing, Sarah Pearman, Ellie Young, and Lorrie Cranor assess the usability and understandability of a healthcare chatbot leveraging cloud services. People may want to use a chatbot for a variety of reasons, but in a healthcare context, there are special requirements and struggles when designing and building one.

If someone hurts their ankle hiking and decides they need an X-ray but are worried about the cost, they may turn toward a healthcare chatbot for answers. In the scenario explored by Sarah et. al, this particular chatbot utilizes Google Cloud services for data storage and natural language processing. Due to requirements under the Health Insurance Portability and Accountability Act (HIPAA), the chatbot must provide notice and collect the user's informed consent before processing their request.

The initial chatbot version presents a standalone HIPAA authorization document that must be read and accepted. However, ensuring that the user is informed and providing valid consent is hard.

When asking for the user's consent we're interrupting their primary task of obtaining a cost estimate for an X-ray of their ankle. Before they're able to obtain this estimate, they must grok in-depth legal document that is time-consuming, hard to read, and is likely inconvenient to them—these factors and others make it difficult to collect informed user consent.

To address these privacy design challenges, Sarah et. al conduct a 2-phase study to improve the usability and understandability of this healthcare chatbot. In Phase 1, a small sample of remote user studies was conducted to evaluate prototype versions. Phase 2 leveraged a largescale, crowdsourced survey involving thousands of respondents. Sarah et. al hopes to answer questions like:

  • What and where are the usability barriers for the healthcare chatbot?
  • Do users understand that their data will be processed by Google Cloud?
  • Do people understand how data protections may change after being shared?
  • How do people feel about the HIPAA consent flow?

To answer these questions in Phase 1, Sarah et. al iteratively created four different prototypes of the healthcare chatbot. Each of the prototypes adds, removes, or modifies several features to improve the overall usability and understandability for users. Some of the design considerations include keeping the user in a single modal view vs. opening additional tabs, creating more human-readable summaries of legal language, and removing mentions of HIPAA entirely to avoid giving a false sense of security.

For a more thorough breakdown of the prototypes and their differences I really recommend you watch the talk! In Phase 2, thousands of survey respondents were presented with one version of the healthcare chatbot and then asked a variety of questions.

Before reviewing one of the prototypes, the majority of respondents (68%) incorrectly thought that HIPAA would prevent companies from sharing their health information. After viewing one of the more advanced prototypes, participants were able to understand the protections that HIPAA provided, and perhaps more importantly, the protections it doesn't provide.

Participants were initially comfortable with the idea of using a healthcare chatbot. However, after reviewing a prototype that revised the notice and consent language, participants were less confident in the privacy protections afforded to them. In general, people tend to drastically overestimate or underestimate the risks of sharing their healthcare data with a third party like Google Cloud.

Simplifying the consent flow and legal language improved users' usability and understanding of primary privacy concepts. However, this only happens if participants actually read the disclosure and authorization forms.

The majority of people do not have an accurate understanding of what it means for data to be subject to HIPAA or what data protection practices they can expect. In fact, Sarah et. al recommend avoiding the use of the HIPAA acronym when possible—it may provide users a false sense of security.

Building a Scalable, Machine Learning-Driven Anonymization Pipeline for Clinical Trial Transparency

In Building a Scalable, Machine Learning-Driven Anonymization Pipeline for Clinical Trial Transparency, David Di Valentino and Muqun (Rachel) Li share the challenges they encountered when anonymizing unstructured clinical research documents using artificial intelligence and machine learning.

Clinical research data is sizeable and frequently includes 100,000+ pages of information on thousands of study participants. In addition to the size of the data, the unstructured nature of medical data introduces unique complexities for anonymization strategies.

Clinical trial transparency defines requirements around registering medical trials and providing access to anonymized, patient-level data to be used in subsequent analyses. Specific requirements for anonymizing medical data may differ from law to law. However, generally speaking, personal data becomes anonymized data when it's transformed and rendered non-identifiable. In the healthcare context, direct identifiers like names and patient numbers, and indirect identifiers like age, vitals, medical history, and other details must be altered.

One challenge when anonymizing clinical research data is that it is long-tailed. That is, there are a large set of circumstances and diagnoses that are unique to a small subset of individuals—this data is identifiable and too specific and requires transformation.

David and Rachel share a few data transformation strategies including: generalization, suppression, or the use of date-shifting algorithms to modify dates in healthcare records. These transformations should ensure data utility and quality while being privacy-preserving.

David and Rachel chose to leverage artificial intelligence and machine learning to satisfy the above requirements and hope to create tools that are usable by non-experts who lack a background in machine learning or data science.

They propose using Named Entity Recognition (NER) to identify and anonymize both direct and indirect identifiers. David and Rachel start by utilizing off-the-shelf NER models but quickly discovered these performed poorly when applied to the healthcare data—they were only able to correctly classify 70-80% of attributes. To address these shortcomings, they turned to transfer learning. This shift allowed them to jump-start and fine-tune their models using domain-specific data. After they retrained the baseline NER models, they were able to identify 97% of attributes correctly.

A high accuracy model is important in this context as each miss by the model must be hand-corrected by an individual.

One important question that David and Rachel raise is "what quantifiable guarantees do we have that we are anonymizing data enough?" Their approach states that an individual's data must be statistically similar to at least 10 other study participants. For a detailed look into how they generate confidence intervals to provide upper bounds on re-identification risks check out the talk!

In terms of lessons learned, David and Rachel's findings suggest that machine learning and artificial intelligence can be used to anonymize clinical health data at scale. However, it's by no means perfect and still requires human verification of results. Additionally, a statistical, quantifiable approach is important to measurably balance re-identification risks and data utility.

Building for Digital Health

In Building for Digital Health, Aditi Joshi and Dr. Oliver Aalami share lessons learned by developing an open-source platform for digital health research and applications. CardinalKit provides a framework to enable scalable healthcare services, lower the costs of access and delivery, and aim to deliver equivalent or better health outcomes through digital applications.

Every year medical clinicians attempt to conduct medical research in the digital health space. Researchers must generate an idea, determine what platform to use, identify metrics, acquire funding, hire and manage engineers, design studies, receive approvals, and finally conduct their research. This whole research process generally costs between $200,000-425,000+, requires 15 months of development time, and takes 2 years to launch.

Dr. Alami's goal with CardinalKit is to reduce the application development time by 18 months, save developers $150,000 in development costs, and simplify and unify the process overall.

To ease the burden of conducting digital healthcare research, CardinalKit provides boilerplate infrastructure. Developers are provided an iOS mobile application with a backend hosted in Google Cloud Platform. The iOS application comes pre-loaded with various integrations including: ResearchKit, HealthKit, watchOS, and CoreMotion. Additionally, it provides Bluetooth connectivity for wearables, 2-factor authentication, and enables granular data sharing controls for research participants.

With CardinalKit, developers can be up and running in a few hours.

If medical researchers use CardinalKit, they can directly port their application to Stanford's IT infrastructure once their studies are approved. This allows Stanford to directly manage any instances and provide central, well-vetted security and access management controls by default. Stanford also maintains a separate Business Associate Agreement with Google Cloud that satisfies legal obligations under the Health Insurance Portability and Accountability Act (HIPAA).

While CardinalKit may not make an application HIPAA compliant by default, it can help make it HIPAA-ready. Developers can offload certain responsibilities to the framework, like 2-factor authentication and encryption requirements, so researchers can spend their time and energy elsewhere. For a detailed breakdown of the distribution of risks and responsibilities between researchers, CardinalKit, Stanford, etc. check out the talk!

In sum, CardinalKit provides boilerplate infrastructure that can substantially reduce the financial cost and time required to jump-start digital healthcare research. It helps ensure developers follow particular best practices, guidelines, and protocols to comply with legal obligations like HIPAA.

Wrapping Up

I hope these posts have piqued your interest in PEPR 2021 and future iterations of the conference. Don't forget to check out the other Conference Recaps for PEPR 2021 as well!

If you liked this post (or have ideas on how to improve it), I'd love to know as always. Cheers!