The problems with the Health Insurance Portability and Accountability Act (HIPAA) and current methods of protecting the privacy of individuals in research are being challenged in ways that were not possible in previous decades due to the ease and use of big data.

“The solution HIPAA gives us for guaranteeing privacy of health information is unsatisfactory in a lot of ways, and that’s been demonstrated over and over again,” says Stephanie Malia Fullerton, PhD, professor of bioethics and humanities at the University of Washington School of Medicine. “The question is what to do about it.”

Data scientists and other savvy investigators can combine de-identified data in a way that makes cross-references and re-identification possible.

For example, the authors of a 2018 study that examined step count data from mobile technology found data that only contains the number of steps a de-identified person takes is enough information to uniquely identify individuals.1 Another study revealed de-identified personal information could re-identify 99.98% of individuals in any data set using 15 demographic attributes.2

“This is where data security procedures become really important,” notes Megan Doerr, MS, LGC, principal scientist, governance at Sage Bionetworks in Seattle. “How are the data being safeguarded? Are there ways to prevent tampering with data? These sorts of questions are important for IRBs to understand, and they can impact the integrity of research being proposed.”

All an investigator has to do is purchase a commercial data set and cross-link it to health data such as information from wearables, explains Stephen Rosenfeld, MD, MBA, president of Freeport (ME) Research Systems. Rosenfeld is the chair of the Secretary’s Committee on Human Research Protections. Anyone can purchase commercial data sets, which means that everything is readily identifiable, he adds.

“A health record for a white female in Seattle is not inherently identifiable until you combine it with other information,” Fullerton says. “Combining data sets poses the privacy risk.”

Cross-referencing also poses problems. However, it does give data great meaning and utility. “It gets very tricky, very quickly,” Fullerton says.

Address Privacy Risk

Investigators, IRBs, and research bioethicists might be hesitant to confront this privacy issue because of the potential for useful research to be shut down, Rosenfeld says.

“We have to expect that ethics reflects the expectations of society,” he adds. “Everyone knows about this problem, so it’d be nice if we talked about a framework for ethical expectations for research with big data and tried to understand what people found permissible and what their expectations are.”

Another example is a researcher who purchases data from a large grocery store chain that includes diabetic testing kits, says James Riddle, MCSE, CIP, CPIA, CRQM, vice president of institutional services with Advarra in Columbia, MD. This information could overlap with a medical database, and investigators could look for clusters of people within a 25-mile geographic area to see where diabetes cases are most prevalent, he says.

Combining the information could lead to re-identification. “By themselves, the individual data sets might not even constitute human subjects research because they are de-identified,” Riddle says. “But when you combine these data sets, then they could become identifiable, and there are risks that IRBs would have to evaluate and weigh.”

IRBs should consider these privacy issues with studies using big data:

• Decide on informed consent. “The two main issues are whether they are human subjects and can you waive consent,” Rosenfeld says. “Even if you don’t name them, in a way you are using them as research subjects — but without names. They are virtual subjects. Whether that is deserving of protection or not is a question.”

Research projects in which investigators agree to not cross-link the data with another database that would enable re-identification can be considered not human subjects research, Rosenfeld says.

“If it is human subjects research because data are identifiable, then it’s the nature of big data research that you really have to have a waiver of consent,” he adds.

Under the Office for Human Research Protections, the general understanding is that giving blanket consent to future research is not seen as compliant, Rosenfeld adds.

“The regulations try to address that through adoption of broad consent. It was well-meaning, but there are practical issues with its application, and I don’t think anyone is using it,” he says. “There needs to be limited IRB review, and you could set expectations; for example, ‘I am not going to use this biospecimen in this bank for research regarding human reproduction,’” he explains. “Someone has to review individual studies to make sure they’re consistent with those limitations.”

Understand Data Storage

• Protect stored data. Data storage is somewhat different and a little more complex than it was even a decade ago. For instance, instead of worrying about laptops and computer hard drives, researchers need to consider the safety of cloud storage and wearable sensors.

If researchers have legitimate reasons to collect monitoring information using commercial wearables, Fullerton says IRBs might ask these questions:

- Where are data stored?

- Is the company that makes the device also holding data?

- Who can access the information?

- What safeguards are in place to ensure data are not used?

- What happens at the end of the study?

- Where do the data go?

“These are bread-and-butter questions,” Fullerton says. IRBs should insist this information is part of the consent process, informing people the device manufacturer also is collecting data and may use it for commercial purposes, she adds.

“If you are providing an Apple watch as part of an incentive for the study, you are exposing [participants] to a risk they might not have voluntarily chosen to be exposed to,” she adds. “This should be revealed as part of the informed consent process and carefully managed.”

• Maintain anonymity. One way to retain de-identified status is through technology.

“There are technologies like differential privacy,” Rosenfeld says. This is when researchers use technology to obfuscate data. They add noise and tweak values so it cannot be matched against a big data set. This could include changing ZIP codes or other information that is irrelevant to the actual research outcomes, he explains.

“You could maintain the anonymity of the data that didn’t answer the big research question,” Rosenfeld says. “It’s possible to do that, and there’s small literature on that, but the hurdle is that it has to be done on a study-by-study basis. It’s expensive and adds another layer of ethical complexity.”


  1. Na L, Yang C, Lo C-C, et al. Feasibility of reidentifying individuals in large national physical activity data sets from which protected health information has been removed with use of machine learning. JAMA Netw Open 2018;1:e186040.
  2. Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun 2019;10:3069.