Data mining, data warehousing may be relatively new, but here to stay

Here's how clinical trials will improve with them

The use of data mining in the pharmaceutical clinical trial industry might still be in the Triassic era, but that is changing, as industry leaders learn more about how new methods of using research and medical data can improve clinical trial efficiency and patient care, according to an expert.

"People talk about individualized medicine, and data mining is the right tool for the delivery of individualized medicine," says Andrew Kusiak, PhD, a professor of industrial engineering at the University of Iowa in Iowa City, IA.

The pharmaceutical industry and clinical trial industry have only touched the surface of data mining, often mistaking data analyses for true data mining, Kusiak notes.

For historical reasons, true data mining has not yet caught on in the industry, he says.

"There are many statisticians who do data analysis, and as this new science of data mining emerges, it's up to them to either accept it and recognize it as something that's different or they could just basically stay with what they already know and try to reject it as a tool," Kusiak says.

Each time an established industry is presented with a new tool or advancement, it takes some time for the field's practitioners to accept it, Kusiak says.

Kusiak often speaks at clinical trial and pharmaceutical industry conferences, offering attendees a tutorial in how the new data tools work.

Here are some basic data mining concepts and their definitions:

  • Data mining: "It's an emerging science that deals with the discovery of patterns in data, and those patterns can be presented or interpreted in different ways," Kusiak says. "One way to interpret those patterns is think of them as new knowledge."

The new knowledge, which is drawn from the historical data, can be used for various purposes, including decision making, Kusiak says.

"For example, one could extract knowledge, and that knowledge could be used to prescribe the right medication in the right quantity for minimal side effects for a patient," Kusiak says.

"The knowledge also could be used to select the most appropriate patients for a particular clinical study," Kusiak adds. "So, not only the effectiveness of the study and the outcomes could be maximized, but also the cost of the study could be minimized."

While some people might view data mining as a subset of statistics, that's not a proper definition, Kusiak says.

"It's essentially its own discipline that has different shades and other theories," he explains.

While statistics is a population-based science in which its goal is to discover the truth about populations, data mining looks at individual information and tries to analyze individuals in the context of a subset of other individuals, Kusiak says.

"Data mining delivers what's good for individuals, rather than what works for the population, and it's a very distinct science," he adds.

So people who use data mining from the statistical perspective, haven't taken the effort to look at algorithms and theories that come with true data mining, using it in an individualized or patient-based perspective, rather than a population-based perspective, Kusiak says.

Data mining was created around computer science, using some statistics and mathematical logic, and that's one reason why it hasn't caught on in the clinical trials business, Kusiak notes.

"There's a huge gap now between computer scientists and clinical application, and so that's where the problem comes in," he says.

The solution is to educate more people in the industry about data mining, but this strategy sometimes encounters the barrier posed by patient confidentiality, Kusiak says.

"In my classes, people ask me for case studies, so I always tell them to give me their data, and then I'll show them case studies," Kusiak says. "The data we use with algorithms requires us to sign many disclosure forms, so that's one of the barriers."

While it would be nice to have benchmarked data sets available, and people could present the value of data mining based on those data sets, it has not yet happened because of the confidentiality barrier, Kusiak adds.

  • Data flow modeling: Before data mining can begin, it's important to know the location of the data and how to manage better data, Kusiak says.

Data flow modeling identifies where the data are, helps to improve the data flow, and defines appropriate data, Kusiak says.

"In many applications, there are plenty of data that are very useless essentially," Kusiak explains. "There have been some processes in place or someone was overzealous, and people don't pay attention to it, so over the years you might have been collecting data that nobody is using."

This is why a first step is to establish processes for good data collection and assessing data quality, he says.

A data flow model would follow the organization and flow of data within an organization, improving and optimizing the flow in terms of quality and flow cycle, Kusiak says.

"Some data may take too long to get from one place to another place," he offers as an example of a data flow problem.

The model could help improve efficiency and reduce the cost of data collection, Kusiak adds.

"This is a business framework we're bringing to data flow," Kusiak says. "It's a known method that has been used since the 1990s in the main industries, but is new to the pharmaceutical companies."

The clinical trial and pharmaceutical industries have given little thought to optimizing data flow, he notes.

"Looking at this from a wide perspective, you have data flow within an organization; data flow between organizations, like pharmaceutical companies and clinical research organizations [CROs] or pharmaceutical companies and the FDA, so this applies to everyone," Kusiak says. "We can use other terms like streamlining data flow or reducing bureaucracy within organizations, and those goals could be accomplished with data flow modeling."

  • Data warehousing: "Between data flow modeling and data mining is warehousing," Kusiak says.

"We typically mine transactional databases, which are databases and files stored at CROs or pharmaceutical companies, or even at the FDA," he explains.

"In recent years, the warehousing technology has been introduced," Kusiak says. "It's a collection of different databases that has been transformed, and the data quality has been improved and is designed for effective usage."

For instance, if there is a clinical question of any type, then the person posing the question could obtain the answer within a fraction of a second rather than having to dig through six different data bases, Kusiak says.

Data mining is another application.

Suppose a warehouse had genetic data that linked with patient data, and the entire database could be mined, Kusiak proposes.

Or, another potential use of warehousing would be to merge data from different clinical areas, in one centralized repository, and then mine them, he adds.

Data warehousing has a great deal of potential, but also can be costly, so companies carefully investigate its potential use, Kusiak says.

Computer software can handle all of the security issues arising from data warehousing, he notes.

"If you have someone's patient or genetic data, you cannot find out who this person is if you don't have the patient identifier or address because the data doesn't reveal this information," Kusiak explains. "So what needs to be protected is the personal identifier."

Also, the personal identifiers do not have to be stored in the data warehouse. These can remain with the doctor and facility providing medical care, Kusiak says.

The other advantage to a data warehouse is it provides additional protection of data by permitting storage off site. For instance, this could be helpful in the event of a natural disaster.

  • Knowledge management: "Knowledge is important, so companies are always concerned with managing knowledge assets," Kusiak says.

"Some knowledge stays with people, so we have to take good care of people so they won't take their knowledge and disappear," he adds.

Another part of knowledge management is to prevent knowledge degradation by taking better care of the data, Kusiak says.

"If I have good quality data, then I could have the best knowledge and won't be using knowledge that's out of date," he explains.

The goal is to store good quality data for long periods of time and use data mining when it's necessary to extract information that can be put to good purposes, Kusiak says.

This could include textual information, interviews, and discussions, he says.

"Within data mining, we could have text mining algorithms, so we can extract knowledge from both data and the test at the same time," he says. "So I could go through emails in a corporation and minutes of meetings and extract knowledge from these."

In the past, knowledge was stored in repositories and was cumbersome to extract.

"Things are changing quite fast today, so by trying to protect a process that doesn't work today doesn't make sense," Kusiak says. "Knowledge that's five years old is not applicable now."

The goal is to extract new knowledge as needed, and this can force controversial changes within an organization, he notes.

For example, a company's department that manages knowledge will have to solve the problem of deciding what to store and what not to store, Kusiak adds.

"But for clinical trials, especially, and the pharmaceutical industry, the main thing is data which dictates everything else," Kusiak says. "Drug creation is about data, and so it's a data-driven business, and there's value in the data."