Artificial intelligence and machine learning are revolutionizing early-stage clinical trials. Technology has the power to speed up these trials — meaning they move to late-stage testing and to the general market faster — while also making them safer.
One way AI tools protect patients is with synthetic data, using either simulated data or obscured data to protect the anonymity of trial participants.
Why is synthetic data used and why is it so valuable?
Synthetic data allows researchers to have access to information they otherwise wouldn’t have. The information is private, complete and can be reviewed in great detail with no harm to patients. This is particularly useful in early-stage testing when trial leaders are focused on dosage and safety.
“Researchers, innovators, entrepreneurs and policymakers all are creating synthetic patient records to answer a number of important healthcare questions,” says Robert Lieberthal, principal health economist at MITRE. “The types of interoperable, complete patient records that exist in synthetic data sources rarely exist in the real world...breaking the silos that exist between different provider groups.”
More researchers are willing to use synthetic data as its quality and value are proven within the industry. Kalyan Veeramachaneni, principal research scientist at MIT, led a study that compared the results of synthetic data against the same studies with real data. His team hired 39 freelance data scientists to conduct 15 tests across five datasets. There was no significant performance difference in 11 out of 15 tests — a 70 percent success rate. Other studies have shown even closer results, proving that synthetic data can be relied upon.
As more clinical trial professionals use synthetic data, they see its value. It’s no longer up to early adopters and industry pioneers to show that it can be used.
One of the main benefits of synthetic data is the patient privacy that comes with it. There’s no risk of patients having their health information exposed or potentially stolen.
“All patient information must be redacted due to its sensitive nature,” writes Jasmine Lee at B2B software and services review platform G2. “Compromising patient security and confidentiality by failing to remove all identifying attributes goes directly against healthcare compliance guidelines such as the Health Insurance Portability and Accountability Act (HIPAA), Health Information Technology for Economic and Clinical Health Act (HITECH), and General Data Protection Regulation (GDPR).”
Some synthetic data sets already have the information redacted, while others are completely artificially manufactured. This allows pharmaceutical companies and clinical research organizations (CROs) to focus on the data itself, rather than spending resources on privacy management.
In most cases, researchers can circumvent all privacy concerns because synthetic data is not connected to real individuals. Additionally, researchers can use these large data sets to research specific diseases instead of relying on a much smaller pool of patient information.
Furthermore, the use of synthetic data means researchers can push the boundaries of what they study and how they evaluate the information. They can try new forms of analysis or look at the data in different ways without worrying about privacy violations. The result is better early-stage clinical trials which in turn can bring the best possible treatments to market.
Along with improving the clinical trial process, synthetic data also has the power to speed up early-stage testing.
“Long internal governance and sharing processes prevent data from flowing seamlessly,” the team at data anonymization solutions provider Statice writes. “When medical researchers request data, processing these queries can take weeks and not even return the desired data points. Unfortunately, in the context of crisis management, slow data access is even more detrimental.”
The COVID-19 pandemic highlights how quickly researchers need to move when conducting clinical trials. Researchers can’t spend months just trying to get their hands on a quality dataset. This not only slows early-stage trials; it can even prevent researchers from receiving the resources they need when they advance to late-stage testing.
Additionally, synthetic data is flexible because clinical trial experts aren’t limited to one option. Fully synthetic data does not contain any original data. It is created based on patterns from real data sets but it is impossible to identify real patients because there are none. Conversely, partially synthetic data replaces only sensitive data with synthetic information. There are still true models in the dataset, which means some regulatory disclosure is possible.
Early-phase trials can benefit from both types of data sets, depending on what the clinicians are studying and how they plan to present their findings.
Some industry leaders prefer the use of synthetic control arms to reduce clinical trial costs. Instead of creating a control group that doesn’t receive the treatment, teams can use synthetic data for the control aspect to simulate real-world experiences.
“Fear of being assigned to placebo is one of the top reasons patients choose not to participate in clinical trials,” writes Jen Goldsack, executive director at Digital Medicine Society. “Using a synthetic control arm instead of a standard control arm ensures that all participants receive the active treatment, eliminating concerns about treatment assignment.”
An example of this is provided again by the COVID-19 pandemic. Thousands of people wanted to sign up for clinical trials to get the vaccine, but some didn’t for fear they would receive the placebo. Eliminating the risk of getting a placebo can further speed up the clinical trial process as teams can successfully recruit participants at the outset.
One of the best ways to understand the role of synthetic data in early-stage clinical trials is to learn about the innovators and the solutions they are trying to create. These AI and medical professionals work together to overcome some of the biggest ethical and logistical challenges in synthetic data.
For example, some data professionals are working to create central banks where researchers can pull data and simulate their early-phase tests. One bank, called Simulacrum, imitates data from England’s National Cancer Registration and Analysis Service. The data is synthetic and entirely artificial, but the system presents the data in the same way doctors would review real information. The creation of a synthetic data bank makes the information accessible while also streamlining the datasets that medical research teams have to work with.
Other teams are pushing the limits of what machine learning tools can do. Mihaela van der Schaar, professor of machine learning, AI, and medicine at the University of Cambridge, describes how she uses generative adversarial networks (GANs) to improve synthetic data while protecting patient data.
With the GAN framework, a generator creates synthetic data samples and a discriminator identifies the real samples. The two work against each other with the generator constantly trying to improve the data and the discriminator attempting to catch patterns that reveal real cases. With this, the data continuously improves.
The Charité Lab for Artificial Intelligence in Medicine in Berlin also uses GAN frameworks to push the limits of AI to create accessible synthetic data for healthcare professionals to use. Their lab is comprised of machine learning engineers and medical doctors who work together to extend the uses of technology in healthcare.
With each new AI tool or synthetic data bank, the healthcare industry is better able to protect patients and conduct tests in a more controlled manner.
“We’re always working with local clinicians to understand the gaps in care within communities and the innovations that can fill those gaps,” says Reg Joseph, CEO of Edmonton-based healthcare economic development organization Health City. “The goal of these projects is not only to create new business opportunities but to generate data that can inform policymakers and bring about a new way of delivering health care.”
Synthetic data is already proving its value in early-stage clinical trials. As more companies adopt this style of testing, it could be a major step toward global data standardization in clinical trials.
Images by: kantver/©123RF.com, primagefactory/©123RF.com, Dmytro Sidelnikov/©123RF.com, Pixabay