What Role Does Statistical Analysis Play in Clinical Research

DISCLAIMER: This article has been updated from its original publication to support Data Privacy Day.

We know the important roles that data and statistics play in clinical research and healthcare. To give insight into the power of data to improve people’s health and to show how statistics works in healthcare, we spoke with Ewa Kleczyk, PhD, Vice President, Client Analytics—Symphony Health.

With Data Privacy Day coming up at the end of January, we should all consider how our data is being used, collected, and shared online. Data Privacy Day aims to helps us envision an online world built on privacy and trust. Read Ewa’s thoughts on data and privacy in healthcare below.

Key Highlights

We spoke with Ewa J. Kleczyk, PhD, Vice President, Client Analytics—Symphony Health about the importance of statistics in data when it comes to study design and research. #WorldDataPrivacyDay

Samantha Mineroff
Samantha Mineroff

Can you tell us a little about your experience and expertise with regards to data and statistics and how you came to work for a health data company? Can you also tell us a little about Symphony Health’s experience and expertise in this area?

My undergraduate degree was in Economics with a minor in Mathematics from the University of Maine. I also have two Master’s degrees, one in—Resource Economics and Policy with Development Specialization from the University of Maine, and the other in Applied and Agricultural Economics from Virginia Tech. I also have a PhD in Economics from Virginia Tech.

I started my career as a statistical analyst in 2005, working for TargetRx, which was a pharmaceutical marketing research firm focused on primary market research. I worked my way up to the consulting organization, then came back through an acquisition of ImpactRx, where I co-led the analytical services team. With the merger of Symphony Health and ImpactRx, I led the commercial effectiveness analytics and, along with another partner, helped in creation of the commercial effectiveness team. In 2017, I was asked to fulfill my current role, which is leading the entire client analytics and delivery team for Symphony, a PRA Health Sciences company. In my role, I lead groups focusing on a variety of analytics areas, including targeting and compensation analysis, brand and managed market analytics, advanced analytics, and media analytics.

Symphony specializes in analyzing data on the commercial side of the pharmaceutical industry. We apply a variety of analytics methods and approaches from simple analytics, to advanced analytics, statistical modeling, and machine learning.

Symphony has a large healthcare claims database of more than 300 million longitudinally tracked patients, which includes prescription claims, medical claims, hospital claims, diagnosis claims, procedure claims, and more. All of this data allows us to run analytics related to product launches, inform the commercial product deployment strategies as well as provide insights related to the utilization or performance of pharmaceutical brands, patient drug utilization, patient cost of care and access, and so forth. Much of our statistical analysis is related to guiding the clients on how their products are performing and how to create strategies to ensure optimal treatment access for patients.

Can you tell us about the importance of statistical analysis and data in the healthcare industry?

From my point of view, statistical analysis is the basis to ensuring that true and representative insights are shared for decision-making processes, as well as public health policy development and establishment.

Robust data and confirmed results can be disseminated within the healthcare industry to ensure improvement of patient outcomes, increase speed to patient diagnosis, and provide the right treatment therapies at the right time to the right patients.

Often, statistical analysis is used for inferential analytics or results, which helps us identify the causality between different variables, especially when we’re looking at product efficacy and studying what is driving those relationships. This way, we can ensure that not only will we bring efficacious and safe drugs to the market, but also understand the true outcomes that we're trying to study, confirm and validate.

Statistical analysis is key when it comes to working with patients, health care providers, and the data that is currently available. The analysis helps us understand what truly drives the decisions that physicians make when it comes to patient healthcare and access. Most importantly, it provides insights to optimize strategies impacting patient wellbeing and their treatment pathways within the healthcare environment.

If we don't apply statistical analysis, or improperly apply it, the overall outcomes can be potentially catastrophic—not only from the safety and efficacy of the drugs, but in general as well. It can create policies that might not represent the true problem we are trying to solve for.

How do we ensure the data we use in our statistical models is the highest quality?

At PRA and Symphony, we primarily work with healthcare claims data and wearable data. These data types represent vast amounts of data. Furthermore, wearables capture and transmit data every minute of every hour of each day, increasing the wealth of information available for analytics. This large amount of data then needs to be analyzed to produce the insights to inform healthcare decisions and public health policy. Analytics teams are the ones responsible for ensuring that the data we work with is pure and unbiased, is reviewed from all angles, and that the insights we garner from it are based on the best and most representative data available.

Unfortunately, data is often not studied from all angles, which leads to flawed and biased insights. For example, you can ask: What are the modes of data collection? What are the biases of those modes? If you're doing a survey, are you really focusing on the piece of the population that might be representative compared to the larger one? All of these items must be reviewed and confirmed. This is where statistics helps us tremendously and guides our research framework.

In order to properly design a statistical analysis or derive any insights, whether through machine learning or simple statistical summaries, you need to understand where your biases lie, how the information was collected, and how representative that information is of the larger patient population. Otherwise, you won’t be able to give logical and sound conclusions.

From the commercial point of view, statistical analysis is really important in ensuring robust, validated analysis results and insights. For example, if you don't understand the biases, coverage, and benefits/shortcomings of the datasets, the insights you provide might be skewed. That will ultimately impact who is treated, how they're treated, with what frequency, and at which time. This might also impact a patient’s access to treatment, their survival and quality of life, and the actual efficacy of these drugs.

How important is data transparency and privacy to reliable statistical analysis?

Privacy is very important, especially to ensure consistency with Health Insurance Portability and Accountability Act (HIPAA) privacy rules that protect patients. It’s a subject that Symphony takes very seriously. It’s our responsibility to ensure that a patient’s privacy is protected in every way when doing analytics or creating datasets.

We need to ensure that we limit the risk that could re-identify an individual and ensure that we provide the right insights, while also protecting patient privacy and data transparency. It’s important to understand how data is collected from vendors, suppliers, public sources, and what procedures they are taking to ensure the data is de-identified and protected.

Furthermore, as we do analysis, it’s our responsibility to understand the risks associated with data elements, combining disparate datasets and the proposed research questions and analytics research plan. We always check what variables are included in the dataset, understand the drivers of increasing the risk for re-identification, and work with the privacy experts to ensure that nothing can be re-identified.

How can statistical analysis and data be used to help guide us through the COVID-19 pandemic?

In collaboration with analytics teams at PRA—medical informatics and data science—we publish biweekly reports that leverage Symphony healthcare claims data to provide insights into trends in product utilization, diagnostics, telemedicine usage, office visits, and more. This information helps to better understand what the prevailing shifts are in the healthcare industry as a result of the COVID-19 pandemic.

Since Symphony garners the COVID-19 insights from our vast healthcare claims dataset, several of our clients have come back and asked to look at their therapeutic areas in order to understand what has changed since the pandemic started to aid their decision making and product strategy. Since the pandemic, recently we've been seeing so much change week to week and understanding these changes is important to informed decision making processes. If you look at Europe, they're seeing the highest numbers of COVID-19 diagnoses since the entire pandemic started. The situation is so fluid that we need to be able to guide our clients and help them understand what's going to happen next week. The predictability has definitely declined with the continued pandemic outbreak.

It’s extremely important to understand the data behind the pandemic, know what the causality of events is, and what is driving some of the changes. We need to look at COVID-19 spread trends, what selected treatments/therapies are available to treat COVID-19, and why these selective treatments are being chosen for different patient groups. Are these treatments more likely to work and be efficacious? What is the survival rate for these patients? What is driving the recovery for those patients that recovered?

Data is key here, and data removes the panic from what we hear on TV. The more you understand the data, the better you can influence policy, the better you can advise your clients, and the better you can ultimately secure or deal with what might happen in the coming weeks. I suggest we all learn to adjust accordingly to what the data tells us. We need to focus on what is actually being seen, versus what everyone is talking about without having a true understanding of the numbers behind the message.

What role do statistics play in the development of machine learning models?

Today, everyone talks about the data science area, advanced analytics models, machine learning models, artificial intelligence, and so forth. From my point of view, we do see differences between statistics or statistical modeling versus machine learning, but these two analysis types are somewhat related. Statistical modeling and statistics play an important part in defining research details before the machine learning models can be run.

The major difference between machine learning and statistics is the purpose. Machine learning tends to create models that can be repeated and can present high accuracy predictions. In contrast, statistics looks at inferences about the relationships between the variables used in the analysis. This is where the importance of statistical modeling or statistics comes into play. It helps design the basis of the research, so the machine learning models can help us confirm the hypothesis that we're trying to test.

For example, when we run machine learning models, we must create patient cohorts. We have to identify the right patient cohorts for the test and for the control groups to ensure robust and unbiased results and insights. Furthermore, we have to understand what the characteristics of each group are and then ensure that the variables we’re including in the datasets make sense in what we’re trying to achieve.

Once machine learning models are run—and what I always advise everyone on my team to do—the next step is look at what the machine learning models identified as important variables. If the important variables don't make sense in the context of the studied topic, then the question will be if the model is relevant to the objective we are trying to solve. The machine learning model tries to optimize patterns in the data, whether it makes sense or not. Statistical analysis can provide the needed interpretation for the results to ensure validity and relevance. If you can't interpret the results, how do you know that this is the right outcome?

For example, this is why we hear one day that morning coffee is good for you, and then another day, coffee is bad for you. There’s really no causality, no relationship, and inferential analysis understanding that drives these findings. That’s the key difference. Machine learning is really a powerful tool to mine through all of the data, create highly predictive models, and create models that are highly repeatable so we can use them over and over again. Statistical analysis helps us design a research study, as well as interpret the study results.

How do data and statistics contribute to better public health?

Along with statistics, understanding the data and using robust analytical methods is central to creating or providing the basis for developing evidence-based public health policies. It can really help us if we understand what’s driving the data that gives us the insights. This way, we can achieve better social health outcomes and ultimately reduce health inequalities.

Recently, the word epidemiology has been widely used. I think the word became the most-heard word world-wide in 2020. Epidemiologists ultimately became the scientists that were needed to explain what's going to happen with COVID-19 pandemic, its spread patterns, etc. We’ve seen many analytical models presenting the COVID-19 spread, and many of them actually came true as of August and September. Again, statistics and data are driving the public health insights and recommendations having experts like epidemiologists, statisticians, economists, data scientists help us build sound public health policies because these are the individuals who truly understand the data and their underlying assumptions. They can analyze data patterns, provide the appropriate context, and then help craft policies that drive health outcomes or healthcare access for patients.

Data and statistics are the key drivers. We must know how to interpret the data in order to create the right public policy. This ensures that we don’t develop policies that ultimately decrease access to patient care or increase healthcare costs. I think that’s especially important right now in the consumer-driven environment, where everyone can self-diagnose via a simple online search. However, when developing public policy, we need to help gauge and guide consumers through the appropriate channels to ensure we can optimize their healthcare access, speed to diagnoses and treatment, and ultimately improve the patient outcomes in this ever-changing environment.

Can you speak to the importance of “good research methods” and the ethics behind them in order to prevent bad data or research getting out to the public, ensuring that the data that the public consumes is reputable?

A few years ago, there was a lot of discussion related to scientific publications and ensuring that flawed analysis and insights are not published. Publishing analyses that are not repeatable and flawed can impact the research community negatively and drive future research on unfounded outcomes. We see this phenomenon in relation to how this topic is talked about in the public media where unfounded information spreads very quickly causing disruption and creating distrust in the media channels. As a result, we have observed how scientific publications have moved towards a more rigorous process of peer-review when publishing scientific papers. The actual datasets and statistical analysis plan are now often requested in order to be published.

You need to truly understand where the data is coming from and what the biases are, as well as remain transparent of the data biases and caveats. That informs the design of the study, how you interpret the study, and ultimately what insights and recommendations you derive.

Secondly, it’s often recommended to have someone else validate that data. That's another component that we have seen increase in the last few years. Can someone else or another party using the same data, using the same assumptions, the same software, obtain the same results? I think the validation of study results by another party has also become a very important factor in validating and confirming the scientific outcomes. More and more data scientists are doing this to ensure their studies gain approval of the larger scientific community.

As a result of the increased focus on validation, I think now we’re back to concentrating on understanding the data, assumptions, and ensuring the insights and recommendations are valid for the particular data and how they are extrapolated to the larger population. In places where it's truly needed, not only do you need a robust peer review, but also validation by another party, using the same assumptions, the same data, and the same software to validate whether the outcome is correct.

PRA works in conjunction with Symphony Health to use data to create solutions for virtually any business that needs to answer questions about any step of the patient health journey. Together, we develop and execute custom research projects to uncover deep insight into patient and physician decision drivers, as well as health economics and outcomes research crucial to ensuring patient access and commercial success.

Learn more about our work with Symphony Health.

Learn more

Explore related content and topics