using synthetic data

Insights on Using Synthetic Data for Clinical Trial Leaders

Clinical trial research has both inherent limitations and constraints imposed by regulations.

Data is at the heart of those limitations. In some cases, the data needed to test a hypothesis might simply not exist, or be insufficient. In other cases, access to crucial data could compromise a patient’s privacy.


A potential workaround to these challenges has emerged in the form of synthetic data. Synthetic data is a broad concept that has applications in a variety of fields. This article will examine its potential in the realm of clinical trials.

What Is Synthetic Data?

When clinical researchers speak of synthetic data, they are typically referring to data that has been generated from an original data set.

The original data reflects measurements of actual events, but it could contain identifying information about the people observed. In such cases, synthetic data can be generated by applying an algorithm to the original data that would reveal the same patterns as captured in the original data, but without compromising identities of the research subjects. This is useful when researchers need to query health data while maintaining patient confidentiality.

The team at U.K.-based data-provisioning startup Synthesized draws clear distinctions between synthetic data and both anonymized data, where there is a 1-to-1 relationship between the original and new data sets, and artificial data, which is generated by modeling the probability of an arbitrary sample of the original data.

Synthetic data, by contrast, does not have the one-to-one relationship with the original data, but “general tendencies like global statistical patterns of synthetic data are nonetheless informed by it,” they write. “Deep learning techniques can be leveraged to improve the faithfulness of generated data.”

As the team at AI Multiple writes, synthetic data mimics real-life events so that researchers can test hypotheses and develop ideas against that baseline. This is useful, they note, when there is insufficient real (or accessible) data, or when researchers need to simulate novel conditions.


How Can Synthetic Data Be Used in Clinical Trials?

Data that can mimic real-life conditions, simulate novel events and anonymize patient information has some important potential applications in clinical trials.

Identifying Trends

For one thing, it could help researchers identify broad trends. Anat Reiner Benaim, Ph.D., from the Rambam Health Care Campus in Haifa, Israel, was the first author on a study published in February 2020 that tested this capability of synthetic data. In the study, the researchers produced various synthetic data sets from the original data from five ongoing studies. Those studies looked into things as varied as proton pump inhibitor prescriptions and blood urea levels among discharged patients who had been treated for acute decompensated heart failure.

In their comparisons, the researchers found that synthetic data in general predicted the results of the real data. “When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data,” Benaim et al. write. “For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.”

Creating Synthetic Control Groups

Because synthetic data can mimic trends in real-world data, it can be useful in creating control groups for trials related to rare or novel diseases for which there is limited existing data.

“Many clinical trials on rare diseases are conducted with very few patients, translating to insufficient statistical power, or are performed as single-arm trials that make it difficult to compare against other therapeutic options without synthetic control methods,” first author Kristian Thorlund and fellow researchers write in a 2020 paper.

The authors note, however, that there is much more research needed into what standards should be created and applied to synthetic controls, as this is an emerging practice.

It’s a practice with momentum on its side, though, as creating synthetic control groups could save researchers significant time and money.

“Imagine a trial that needs to [include] 500 participants in the treatment arm in order to demonstrate the effectiveness of a new therapy,” writes Jennifer Goldsack, co-founder and executive director of the Digital Medicine Society. “Instead of having to recruit 1,000 patients — 500 for the active arm, 500 for the control arm — only 500 participants need to be recruited when a synthetic control arm is employed.”

Making Data More Shareable

Another key benefit to clinical researchers is that because synthetic data sets don’t violate patient privacy laws, they can be shared more easily.

“Because synthetic data contain no protected health information, the datasets can be shared freely among investigators or those in industry, without raising patient privacy concerns,” first author Randi Foraker, Ph.D., M.A., from the Washington University School of Medicine and fellow researchers write in the Journal of the American College of Cardiology.

“In addition, research conducted using synthetic derivatives does not require institutional review board approval.”

Harvard researchers Gaurav Luthria and Qingbo Wang note that this is important because many clinical trials enroll patients who “are more susceptible to reidentification as well as discrimination because of their serious health conditions.”

Luthria and Wang are exploring the frontiers of synthetic data’s place in clinical research, and they have built a novel system for querying original data sets for testing research hypotheses while anonymizing the patients in question. From those queries of the original data, then, synthetic data can be generated to further explore the research and develop models.

The U.S. government is also interested in making synthetic data sets more available to researchers. In 2019, the Office of the National Coordinator for Health Information Technology co-launched an open-source project, Synthea, designed to produce synthetic health records specifically to help with opioid, pediatric and complex-care research.


What Are the Limitations of Using Synthetic Data in a Clinical Trial?

The potential benefits of synthetic data notwithstanding, challenges remain when trying to use this data in clinical research settings.

Synthetic data, it turns out, isn’t always reliable in modeling outcomes. In a 2019 article for BMC Medical Informatics and Decision Making, researchers Junqiao Chen, David Chun, Milesh Patel, Epson Chiang and Jesse James found that Synthea’s predictions didn’t always line up closely enough with real-world data.

The researchers used Synthea to generate a population of 1.2 million residents of Massachusetts. Those synthetic residents mirrored the demographics, social determinants and conditions one might expect from a sample of one quarter of Massachusetts citizens. Then, they tested that data against real-world incidences of four measurable health quality measures:

  • Colorectal cancer screening
  • COPD 30-day mortality
  • Rate of complications after a hip or knee replacement
  • Controlling high blood pressure

The researchers found Synthea’s data greatly underestimated deaths from COPD exacerbation and complications after a hip or knee replacement. “Synthea is quite reliable in modeling demographics and probabilities of services being offered in an average healthcare setting,” they write. “However, its capabilities to model heterogeneous health outcomes post services are limited.”

Further, Foraker et al. found that synthetic data could struggle to model the effects of novel therapies. “[W]hereas synthetic models derived from existing datasets may replicate certain general trends of the dataset, they may not necessarily be able to predict specific trends within a dataset (e.g., all-cause death vs. cardiovascular death),” they write.

“Although this limitation remains theoretical at present, it may be problematic with respect to using synthetic datasets to evaluate novel therapeutics.”

How Should Clinical Trials Integrate Synthetic Data?

Synthea co-creator Jason Walonoski tells Advisory Board’s Andrew Rebhan that synthetic data should not serve as a stand-in for real-world data, which are required for clinical discovery and the verification of results.

At this stage in its use, synthetic data is being deployed to connect research and accelerate innovation. An example of this in action comes from the Israel-based company MDClone, which has created a worldwide network of healthcare partners for data sharing. The company says this model of collaboration can reduce discovery cycles from months to a matter of hours, particularly because probing synthetic data requires no approval from review boards.

Granted, building these kinds of networks on top of the existing clinical trial research infrastructure poses new questions. As Dr. Ulrik Kristensen of Signify Research points out in an article for HIT Consultant, an entire ecosystem of data vendors is rising around clinical trials. It’s an ongoing question as to how this ecosystem of providers will integrate, and what workflows and chains of command will emerge from the cooperation.

Still, synthetic data has a key role to play in establishing baselines, facilitating collaboration and testing researchers’ early hypotheses. Clinical research directors around the world must be ready to integrate this kind of data into their own work in the coming years.

Want to stay up to date with our news?

To top