MINING EHR DATA TO CREATE A MORE COMPLETE PICTURE OF THE PATIENT
- With the addition of new data streams, the EHR is becoming a deeper reservoir holding more detailed longitudinal data.
- The ability to mine EHRs for new details on phenotype, habits, environment and the like is turning on its head the notion that small sample sizes are unreliable; a potential boost to rare disease R&D.
- So what? As the new data streams, largely funneled through the EHR, increasingly influence clinical practice as well as trials, data science is taking on a new operational and cultural role within pharma.
In many ways, the electronic health record (EHR) remains the central tool for capturing the plethora of new streams of data bubbling through health care. Phenotype measures, patient-reported outcomes (often acquired using digital technology), information on habits and environment, baseline genomic data, claims data and other information increasingly funneled into the EHR are adding to a more complete picture of the patient.
Pharma and hi-tech companies are taking notice: witness recent events such as Apple’s announced intention to make health records portable via its AppleWatch and Roche's blockbuster $1.9 billion acquisition of Flatiron Health Inc., which extracts oncology data from the EHR and provides specialized oncology-focused EHR systems. Much of this innovation is aimed at making randomized, prospective clinical trials more efficient – or even replacing them altogether for supplemental indications and other filings for expanding access. The success of these endeavors ultimately will be measured by the ability of data science generally to favorably influence clinical care. This is happening already, at a rapidly accelerating pace and with growing enthusiasm.
Many of the new data streams are analyzed using computational methods and they frequently substitute for data from prospective clinical trials. Because of the resource limitations of conducting full-blown trials, it would be almost impossible to keep up with the pace of discovery without retrospectively focused data science, says Nicholas Tatonetti, PhD, of Columbia University School of Medicine. Tatonetti’s group of computational biologists has tapped into the FDA’s Adverse Effect Reporting System (FAERS), and combining that with their institution’s EHRs of nearly 400,000 patients, last year identified a drug-drug interaction that can prolong the QT interval when taken together (seeExhibit 1). As early as 2011, his group was mining FAERS to show that use of the drugs pravastatin and paroxetine in combination was associated with hyperglycemia.
“Mostly, we are trying to re-extract data that has been entered by different practitioners and providers,” he says, using natural language processing, text processing, and experimenting with auto encoders and the like to mine these notes for the variables they want to capture.
Tatonetti says that with more information of various types available, data scientists can now analyze smaller samplings of patients and show a clinical effect. That “completely turns around” the usual view of data science, he says. “We have taught the public some easy tricks to evaluate scientific results, including being skeptical of small sample sizes and small amounts of data. That guideline is no longer true.” One benefit of a small sample size is that it forces focus on effect size rather than statistical significance.
The new data streams also enable a tighter focus on small contextual subgroups, which in turn allows for more precise predictions tied to things like family background or geographic location. (Just last month, the Tatonetti lab showed how researchers could use familial contact information from patient emergency contact forms housed in EHRs to compute disease heritability estimates for hundreds of disease phenotypes.) “Highly focused studies will be very effective,” Tatonetti says, for example to find a place for a failed Phase III drug in the small percentage of patients who respond.
In some ways, small data is “pointing us back to big data,” adds Eric Perakslis, PhD, CSO of Datavant and a formerChief Information Officer and Chief Scientist at the FDA as well as SVP of Bioinformatics at Takeda Pharmaceutical Co. Ltd.In studying rare diseases, for example, “we are learning that they are boundary conditions to larger disease populations,” he says, “like when you do a jigsaw puzzle and you start with the flat edges first.”
For currently undiagnosed diseases, the goal may be to find a second patient and correlate that patent’s phenotype with the first such patient identified. To do this, clinical researchers play with the first patient’s info, using what’s known about their phenotype and sociology. They put the whole phenotype up, Perakslis explains, hoping to find similarities – perhaps from low muscle tone or weight loss combined with several other factors – that will link the patients. “You are seeing an extensive amount of experimentation, especially at the higher end academic medical centers, across a spectrum of diseases,” he says.
Tapping into patient-reported outcomes (PROs) is another important strategy for being able to make sense of small sample sizes. In some cases that can improve the odds of succeeding in a clinical trial. A patient may make it through an Alzheimer’s disease trial, for example, because a caregiver assures they make their visits and take their medicine. “You can argue that the best indicator of success is the support network around that person,” Perakslis says. In essence, that’s a phenotype. “There are hundreds of thousands of data points about us that are not medical,” he says. Marketing companies do this all the time, he points out. “They profile us like crazy.”
Netflix and Amazon, for example, will use machine learning (the process of using data to ascertain relationships between objects) to predict whether liking a certain set of movies predicts whether one would like another film, or if buying something on Amazon suggests one would also want to purchase certain other things. But to make these tools applicable to health care, a different level of rigor is necessary to make the results meaningful because the implications for generating a wrong result are much higher than when an advertiser attempts to target buyers of consumer products based on their profiles.
Another use for large amounts of data is to draw insights based on an overlay of patients in many different regions: an employer wanting to understand how to best spend their health care dollars to keep their employees healthy, for example. Methodologies like machine learning can suggest other communities where they could roll out the same types of solutions – the same methodologies as Facebook and LinkedIn use to study their networks of members.
Historically, in many ways health care providers are less aware of their patients’ experiences than other verticals like retail or even financial services and certainly hospitality, says Jonathan Slotkin, MD, of the division of applied research and clinical informatics at Geisinger Health System. “We think CRM (customer relationship management) tools will be a critical way to hardwire patient engagement," he believes. “We and others are good at hardwiring quality, but patient engagement is where we are going to see the CRM become the digital transformative tool,” he says.
Following this line of reasoning, the EHR begins to look more like a legacy system, or table stakes, says Slotkin, especially as one goal of care is to move it out of the hospital setting where possible. “We think about data streams outside the hospital, all the bread crumbs that don’t mean much in the EHR but do mean so much: from the grocery store, clergy, social workers, insurance companies not affiliated with our system,” he says. “That sounds more like the work done by a CRM tool – a CRM re-imagined for health care where we really want to know our patients and have a 360 degree view of them.”
Consider the EHR to be a timeline – an episodic account of a patient. “The more you can add to that timeline, the more you can assess what’s going on from an outcomes perspective [by identifying] socio-economic trends that may get the patient into trouble, such as whether they are taking their meds,” says Greg Strevig, vice president, enterprise analytics at Geisinger. “If you add in those data points and look across chronic conditions, you can stack those data points on top of one another and start to see trends based on outcomes and bend the curve.” Thus, the EHR may be a cornerstone but “what we are not [yet] seeing are the other events on the timeline that need to be added,” he says. “It will become the mechanism for the integrated delivery of care.” Less in terms of multiple databases or layers of data and more by bringing together data in one place in a merged environment to create a longitudinal record. “We find data outside the system, like claims data or PROs, feed it back in so the provider has it at time of service. Then use the big data platform for managing their care and reaching out to the patient proactively."
Contrast that to the core piece of data in a clinical trial management system – the case report form. That is usually more useful for data science because it is connected with a protocol, and so creates an expectation and gauge for numbers of doses, numbers of visits, etc. As the EHR is built out more, it approaches having that kind of value to trialists.
Apple is betting that combining the deeper longitudinal information of the evolving EHR with the connectivity of its mobile health app will deliver a truly portable health record into patients’ hands. The EHR itself is still largely impenetrable.
With Apple’s Healthcare kit, enabling FHIR interfaces (the Fast Healthcare Interoperability Resources developed by the HL7 international standards for transferring clinical and administrative data) for the EHR, and growing numbers of people writing apps that are allowing novel and new measurements to go into the EHR, “we are starting to see the ability to take data from a wearable device like an Apple Watch, put it into your smart device then through the FHIR interfaces upload into the EHR,” says John Cady, Chief Data Officer and CTO at Geisinger. Then, for example, correlate with other events like exercise rate and overall calorie expenditure, which could help direct the daily activities of a diabetic or obese individual.
Flatiron As Exemplar
The hefty purchase price Roche paid for Flatiron is recognition of the value of a deeply integrated dataset – extracted from the EHR – that captures the intricacies of the patient journey and approaches clinical research grade information.[See Deal]It parallels Roche’s valuing of Foundation Medicine Inc.’s tumor profiling capabilities: the pharma spent more than $1 billion for a majority stake in that company in 2015.[See Deal]
Both Flatiron and Foundation Medicine partner with multiple drug companies in addition to running commercial businesses. Having strong academic clinical relationships and access to a wider swath of data than from a single pharma is critical to their models. And unlike other genomics-based molecular diagnostics acquisitions, including Novartis AG's $440 million purchase of Genoptix Inc. in 2011, Foundation Medicine, and now presumably Flatiron, are expected to continue to thrive as independent companies and not merely captives for its parents’ drug development.[See Deal]
Much of Flatiron’s work centers on a manual extraction of unstructured data from EHRs. “There’s a lot of useful data in the EHR but you need domain expertise to extract it,” says David Shaywitz, MD, PhD, senior partner at Takeda Ventures. “It’s inelegant to have people do it, but that’s what they had to do.” Then it’s possible to apply tech expertise to enhance how you display the data, share it, dashboard it and use it. “Once you have a really good dataset, well characterized and well labeled, it can be the basis for AI going forward,” he says.
Flatiron uses EHR data to simulate contemporary control arms in clinical trials for pharma partners. “We have examples of Flatiron data and electronic health record data being able to replicate and mimic clinical trial control arms right now,” saysSVP & CMO, Amy Abernethy, MD, PhD. That allows drug developers to understand how patients who would be allocated to a control arm in a prospective trial would perform. “We can now perhaps imagine a world where you don’t need a control arm or only under certain scenarios or specifications,” she says.
In discussing the deal with investors during a quarterly sales update soon after the deal was announced, Roche CEO Daniel O’Day told how Flatiron data helped Roche accelerate worldwide access to its US-approvedAlecensa (alectinib, for ALK-positive advanced non-small cell lung cancer) by more than a year-and-a-half in some countries by showing regulators in different countries how a control arm of a trial would perform, given the standard of care regimen in those countries. “You can’t possibly, in a Phase III clinical trial, have every different treatment regimen that might be appropriate for a reimbursement authority around the world,” he explained. (Also see "'Watch This Space' Roche Execs Say, Outlining RWE Rationale For Flatiron Buy" - Scrip, 26 Apr, 2018.)
Flatiron also takes advantage of the availability of new kinds of data to advise on how to improve study design planning, assess the impact of eligibility criteria, and get smarter about sample size by trying to think through what a final data set should look like and which data points will have very high confidence and which should be collected in different ways.
Another place where data technology is making an impact in drug development is the use of data technology to allow for pragmatic clinical trials. Where the study is happening within routine care and data are being collected in the EHR. Indeed, although still early days, a plethora of clinical studies are showing that it is possible to draw clinical conclusions from data housed in the EHR, which can help improve the correlation between treatments and outcomes and favorably affect patient care as well as enhance drug development. (Several of the studies listed inExhibit 1. also highlight the methodologies in place for aligning and integrating data from different sources.)
Initiatives centered on using data science to make trials more efficient are also springing up. In the US, the NIH Health Care Systems Research Collaboratory is creating a new infrastructure for collaborative research with healthcare systems and is supporting the design and rapid execution of pragmatic clinical trial demonstration projects.
The OptumLabs Experience
By taking existing data and applying a variety of statistical techniques, clinical researchers can identify a control and a matched treated group in such a way as to reproduce the appearance of a clinical trial and demonstrate effect sizes that are more or less overlapping with the clinical trial and also adverse event profiles. It’s a different approach than to use data as a control arm for identifying patients and randomizing them and following them with real-world evidence (RWE), as Flatiron does.
These data can replicate clinical trials, to show FDA that substitutes for prospective studies are possible in some cases. For example, the OptumLabs subsidiary of UnitedHealthGroup’s Optum Inc. is working with pharma partners to create a framework that can reproducibly identify the structure of RWE studies using information already captured in their databases. Along with the multiregional clinical trials center at Brigham and Women’s Hospital, OptumLabs has also initiated a program called OPERAND (observational patient evidence for regulatory science and understanding disease) to show how real world data could substitute for clinical trials in a number of settings: for label extensions, in special populations and other areas where you have a drug being used off label and often the cost of a pivotal clinical trial in that off label use is not offset by the value. “If possible to put together observational research and RWE in a way acceptable to FDA, we can have another pathway to getting that label extension or other approval,” says OptumLabs CEO Paul Bleicher, MD, PhD.
Bleicher also emphasizes the importance of improving the administrative aspects of care using deep learning and data analytics. Take prior authorizations or approvals of transfers of hospital patients. “You have trained professionals spending a lot of time reviewing charts and coming up with a yes/no answer,” he says. OptumLabs and its parent UnitedHealthGroup possess millions of data points from chart data as well as the decisions that were made, which they are using to create a labeled data set that can then be used to train a model for making administrative decisions. And if those decisions were automated, for 80-90% of them the trained professional could focus on exception management, complex cases, and places where they are working more at the top of their license, more so than routine work, he says.
Data used to be an aspect of science and now, data science is its own fledgling domain and needs its own structure, says Datavant’s Perakslis. Statistics, technology, and linguistics are all part of that. (Alexa McRae, who did clinicaltrials.gov, is a linguist and not a biologist, he points out. “Interesting that the people at the National Library of Medicine decided clinicaltrials.gov was a linguistic problem,” he says.)
Within companies, the commercial teams have the budgets for buying large, expensive data sets. Medical teams have history and experience and the statisticians and analysts who are used to working with data sets, but historically they have smaller budgets. So for the medical teams to have access to the fruits of data science, the commercial side has to see the value of it. But the commercial side doesn’t have the analysts and the clinical understanding to know how to truly derive value from the data sets. So they need to have access to people on the medical side, who traditionally neither shared money nor talked much. It can be a Catch-22.
Moreover, there is a reason those groups don’t talk to each other: walls have been introduced between them because of requirements from the FDA. So the challenge is to figure out how to make these conversations happen while making sure that commercial targeting is not the goal of using these data sets.
That said, the observed value, including the value of the internal stakeholders who know how to work with the data, is starting to change in understanding, Abernethy says. The data scientists who loved the data sets first championed this work. But they didn’t have the access to funding nor had power, she says. It took the involvement of more senior people to lead the commercial work.
“We found that in order for companies to start to think about how to use the data and technology Flatiron had to offer, they had to be organized differently and make decisions differently than they do now,” Abernethy says, because the lower-level champions don’t have that power to reorganize decision-making in a company. There is shifting recognition of the roles of these various folks within these companies and in some quarters, a call for redesign, from product development through to market access, to enable pharmas to more readily translate ongoing digital and analytics experimentation into bottom-line impact. (Also see "Doubling Pharma Value With Data Science" - In Vivo, 7 Mar, 2018.)
Flatiron did not get deal-making traction until it changed the sales process so that the people who were making the decisions about whether or not to buy Flatiron data and use Flatiron technology in a pharma company were at the more senior, heads-of-department organizational levels, Abernethy says. “We set up strategic alignment committees so there would be conversations across many sides of the organizations that historically don’t make decisions together.”
Mining EHR Data: Selected Publications
Source: Peer-reviewed publications
|Disease Heritability Inferred from Familial Relationships Reported in Medical Records
||Mining next-of-kin information collected via patient emergency contact forms housed in EHRs at three academic medical centers enabled the computation of heritability estimates for 500 disease phenotypes, validating the use of EHRs for genetics and disease research.
|A Large Electronic-Health-Record-Based Genome-Wide Study of Serum Lipids
||Nature Genetics (2018)
||A genome-wide association study of 94,674 ancestrally diverse Kaiser Permanente members using almost half a million longitudinal EHR-derived measurements identified a host ofserum lipid level-linked SNPs,highlighting the value of longitudinal EHRs for identifying new genetic features of cholesterol and lipoprotein metabolism with implications for lipid treatment and risk of coronary heart disease.
|Use of Health Care Databases to Support Supplemental Indications of Approved Medications
||JAMA Internal Medicine (2018)
||Real-world data analyses of patients receiving routine care provided findings similar to those found in the randomized clinical trial that established a supplemental indication for telmisartan, an angiotensin receptor blocker.
|Estimation of Tumour Regression and Growth Rates During Treatment in Patients with Advanced Prostate Cancer: A Retrospective Analysis
||Lancet Oncology (2017)
||The application of mathematical models to existing clinical data allowed estimation of rates of growth and regression that provided new insights in metastatic castration-resistant prostate cancer.
|Distribution and Clinical Impact of Functional Variants in 50,726 Whole-Exome Sequences from the DiscovEHR Study
||Linking genomic data to EHR-derived clinical phenotypes identified clinical associations supporting therapeutic targets, including genes encoding drug targets for lipid lowering, and identified previously unidentified rare alleles associated with lipid levels and other blood level traits.
|Coupling Data Mining and Laboratory Experiments to Discover Drug Interactions Causing QT Prolongation
||Tapping into the FDA’s Adverse Effect Reporting System (FAERS), and combining that with their institution’s EMRs of nearly 400,000 patients identified a drug-drug interaction that when taken together can prolong the QT interval.
|Effectiveness of Fluticasone Furoate–Vilanterol for COPD in Clinical Practice
||One of the firstpragmatic clinical trials, in which the study is happening within routine care and data being collected in the electronic health record. It showed thatin patients with COPD and a history of exacerbations, a once-daily treatment regimen of combined fluticasone furoate and vilanterol was associated with a lower rate of exacerbations than usual care, without a greater risk of serious adverse events.
|Identification of Type 2 Diabetes Subgroups through Topological Analysis of Patient Similarity
||Science Translational Medicine (2015)
||Three distinct subgroups of type 2 diabetes were identified by combining a topology-based analysis technique on EMR-derived clinical data.
|Phenomapping for Novel Classification of Heart Failure with Preserved Ejection Fraction
||Going back into the EMR and phenotype mapping heart failure patients identified three distinct subgroups with different disease trajectories.
|Effectiveness of a Clinical Decision Support System for Reducing the Risk of QT Interval Prolongation in Hospitalized Patients
||Circulation: Cardiovascular Quality and Outcomes(2014)
||Using information from patients’ EHRs, a computerized clinical decision support system reduced the risk of QTc interval prolongation in hospitalized patients with torsades de pointes risk factors.