By Melinda Young, Author

Human subjects research increasingly involves the use of large data sets that allow analysts to drill down to the most specific of details from healthcare records or other databases. This creates challenges in ethical data analysis and information privacy.

Researcher Hye-Chung Kum, PhD, associate professor at Texas A&M University in College Station, was involved in a study about this topic, titled “Controlling privacy risk in database studies for human subjects protection via a privacy budgeting system.”1

The study proposed a privacy budget system for human subject protection, using anonymity set size to define allowable risk in database studies.1

Kum explains more about the study and how this privacy budgeting system works in the folllowing Q&A:

IRB Advisor: How do you determine the anonymity set size (n)?

Kum: That should be based on the acceptable policies and practices of the organization and legal restrictions in the application.

For example, I am on a project where the DUA [data use agreement] specifies that aggregate data less than 10 cannot be published. In such situations, when anonymity set size equals 10, then information that represents less than 10 people will pose a risk that needs to be quantified. While, on the other hand, information that represents more than 10 people does not pose risk of identification and no risk is present.

IRB Advisor: Why was this chosen as the way to define allowable risk in database studies?

Kum: I think there might be a misunderstanding. The anonymity set size is not the allowable risk, but rather the threshold where risk needs to be quantified. Information that presents more than the given anonymity set threshold is considered to not have any risk of privacy violation.

There is negligible risk of identification because the exposed information represents many people and, thus, the information is not sufficient to know who it is about. For example, if there are 10 people named “Tom” in the database, there is no way to know which of the 10 people that exposed information is about.

Although this level of detail is not discussed in the poster,1 the framework has a separate parameter that is set, called allowable risk, as a hard threshold if desired. When the allowable risk parameter is set, then no information is displayed unless information represents more than the allowable risk count. The default for this parameter is set to 1, meaning that any information may be displayed.

IRB Advisor: How does your privacy risk score work? What kind of score is it, and what does it mean when a person in a database has a rare name like “Hye-Chung”?

Kum: Let me answer the second question first. When a person in the database has a unique name, such as “Hye-Chung,” this means that the only information that has to be disclosed to know exactly who this person is would be just the first name.

On the other hand, if a person has a common name like “Tom,” this means that name itself is not much of a privacy risk since it has very little power to identify a unique person. So “Hye-Chung” has a high risk score, and “Tom” has a low risk score.

All information that is disclosed can be measured (that is, privacy risk measure) in terms of how unique the disclosed information is. So if you have a birthday, 2/29/2000, and you happen to be the only person in the database born on 2/29, just the month and day of your birth is sufficient to identify you exactly.

Now to answer the first question, the score is based on this principle: The risk score is between 0% and 100%. Where all information is disclosed, the score will be 100%, and 0% represents no information is disclosed. It is an accumulative score for the whole database, so if there are “n” records in all the databases that need to be linked, 1/n% is the score corresponding to one row. Then, for any given information disclosed for the row, it ranges from 0% to 1/n%, “allotted for the row.” It is close to 0% if the information disclosed does not uniquely identify the person represented in the row.

On the other hand, if the information disclosed uniquely identifies a person, then it would be close to the max (=1/n%). The actual formula is pretty complicated and takes into account multiple things to meet certain properties (e.g., everything adds up to 100%).

But the most important thing is that most risk is accounted for when unique information is disclosed for everyone. So even if a lot of information is disclosed, if none of the information actually is unique for anyone in the database (for example, if year of date of birth is disclosed for everyone, but it is not unique to anyone, the risk score would be low), then the risk score is low.

IRB Advisor: In your poster,1 it says that the proposed method has not been implemented, and more time is required to finalize and evaluate the concept. Is anyone planning to evaluate the concept? If not, why not?

Kum: I have a PCORI [Patient-Centered Outcomes Research Institute] project that is working on implementing a prototype and evaluating it. 
(For more information on the project, visit:

We hope to have initial evaluations done by the end of the project period, and have plans to secure more funding to harden the code. You can see how this works in a user study implementation we did at:

You can see how “percentage of characters disclosed” compares with the risk score when different information is disclosed. The “percent disclosed” is simple to understand, but does not represent actual risk. So for certain information you barely see any movement in the risk score, but other information (such as ID numbers), you see large scores. But in the “percentage disclosed,” the score is uniform across all information regardless of the actual risk.

IRB Advisor: If this risk score is evaluated and implemented, how might it change the way privacy is viewed in database studies?

Kum: If the risk score is implemented and used commonly, people would be able to more precisely measure the actual risk of privacy loss in database studies because it would be directly linked to actual risk as opposed to very rough potential of risk when certain identifiers are shared in database studies. It would make it more possible to rely on “expert determination style” of privacy risk management based on real risk, as opposed to safe harbor style which assumes a worst-case scenario, making database studies seem much more risky than they really are.

IRB Advisor: Is there anything else you would like to say about this study and risk score methodology?

Kum: Using and leveraging personal data has inherent risks for information privacy, as it has been proven mathematically that information privacy is a budget-constrained problem. The key to properly manage the risk for legitimate uses of personal data is balancing the utility of data with the privacy risk. In particular, since informed consent in retrospective database studies is not possible, it is important that IRBs learn to balance the risk of harm and benefits to society.

Risk in retrospective database studies can be effectively managed to minimal levels through proper use of secure computing systems and training. Given how common use of personal data is these days, retrospective database studies that use personal data do not pose risks greater than those ordinarily encountered in daily life when such systems and oversight exist.

In addition, IRBs should recognize that authorized access to personal data for research is an important and legitimate use for social benefit. It is critical that we facilitate accountable use of data to improve healthcare and benefit society, just as it is used by the private sector for marketing, campaigning, etc.


1. Ferdinand A, Schmit C, Kum H-C. Controlling privacy risk in database studies for human subject protection (HSP) via a privacy budgeting system. Poster presented at PRIM&R’s 2016 Advancing Ethical Research Conference, Nov. 13-16, 2016, in Anaheim, CA. Poster: 17.