Douglas Fridsma, MD, PhD is chief medical informatics officer of Datavant of San Francisco, CA.
Tell me about yourself and the company.
I’m the chief medical informatics officer at Datavant. Before that, I was president and CEO of the American Medical Informatics Association. Before that, I was the chief science officer at the Office of the National Coordinator for Health IT during the Meaningful Use era, as we were trying to get electronic health record adoption.
A lot of the work I did at ONC was to set up the basic infrastructure for collecting data. The goal, for many of us who were working on these projects, was to make sure that once we collected the data, we would get rid of the lazy data. That is data that would get collected and then just sit there and not be used for population health, a learning healthcare system, or those sorts of things. That’s my history and where I come from — let’s figure out ways to make data useful for patient care and for healthcare delivery.
Describe how tokenization is performed and how the information that it enables is being used in healthcare.
A lot of data out there is fragmented. If you were to try to get your medical record, you’ve got bits of your information that might be in a claims record, some of it might be in a specialty pharmacy, and some of it might be with your primary care doctor or within a hospital in which you were seen in the emergency room. The problem is that when data is distributed like that, it’s hard to bring it all together into a longitudinal view of that particular patient’s experience in the healthcare system.
If you want to link a record from one hospital to another hospital, you have to have some kind of identifiable information. But if you are using the data for research purposes, HIPAA doesn’t allow us to release that kind of information without lots and lots of safeguards, IRB approvals, and things like that.
It is possible to strip out all of the identifiable information from the medical record — eliminating names, genders, changing birth dates from a month and date to just a year, removing addresses, maybe abstracting ZIP codes to a higher level. Datavant strips out that information and replaces it with an irreversible hash that we call a token. It’s like baking a cake — you cannot go back and get back to the original ingredients. This hash is derived from a lot of that personally identifiable information, but that hash has nothing that would point that back to the original person.
Datavant allows people to de-identify their data within each of their organizations. Then we have the ability to link that data back together without ever revealing a person’s name, Social Security number, or phone number. Using these tokens allows data to move in ways that protect patient privacy and that reduce the risk of re-identification.
How reliably can the process generate a token that correctly matches the same patient across multiple data sets?
We did a lot of work when I was at ONC on trying to make sure that we could optimize patient match. Patient match is determined by three things — the algorithm that you use; the kind of data that you use, whether you’re doing it based on a phone number or a name or something like that; and the quality of the data. Probably the biggest impact is making sure that you have high-quality data that can then go through this process to generate the tokens. We work with organizations to make sure that their addresses, for example, conform to the US Postal Service standards.
With high-quality data and the algorithms that we use to generate these tokens, our metrics can be very high. It can be almost comparable to what you would get if you had a Social Security number, the name, or all of the identifiable information. It’s quite comparable as long as you’ve gone through the process of making sure that you’ve cleaned up the data and made sure that it’s accurate and an accurate reflection of the patient’s record.
Does that raise the same challenges as in interoperability, where matching data from multiple systems then brings up the new issue of semantic interoperability, where systems represent the same data concepts differently?
You raise a really important point. Datavant can link two records together and do it in a reliable way while protecting a patient’s privacy. But suppose you have one record that has all of the diagnoses in an ICD-10 code and another one that has all the diagnoses in a SNOMED code. You’ve linked the records together and you know that it’s the same patient, but now you have semantic incompatibility between a record that was collected in ICD-10 code and another one that was collected in a SNOMED code.
That’s not part of the problem that Datavant solves. We do find, though, that in the work that the NIH has done with the N3C — the National COVID Cohort Collaborative – before they run data from everybody who is contributing data through the tokenization engine, they normalize the data to an information model that consistently represents diagnoses and consistently represents things like vaccination status or other things like that. Often you can normalize the data and make it semantically consistent at each one of those sites, and then when you combine them, that data flows together much more easily.
There are ways to do it after the fact, after you’ve done the linkages, because now you might have two records that are inconsistent. The National Library of Medicine and others have ways that you can transform, say, one code into a different code to make that happen. The issue that you raise around semantic interoperability is a critical one, but it isn’t one that is solved by the process of tokenization.
Life sciences, public health and particularly COVID research, and real-world evidence would seem to be good use cases. What opportunities and users do you see for tokenization?
Let me break that down into a couple of use cases that you mentioned and give you some examples of that.
One example that you mentioned was around COVID. We as a country were trying to understand COVID and who got vaccinated, and if they were vaccinated, what their outcome was compared to people who were not vaccinated. The challenge that we had is that people had their vaccinations done at the public health agencies, their primary care provider, or CVS and Walgreens. Their hospitalization or their care might be in an outpatient clinic, the emergency room, or in a hospital setting. The problem was this fragmentation issue. The only way to understand who got vaccinated, who got infected, and who got long COVID was to link together all these different data sources. It’s a tremendously complicated thing to do, particularly because you have to have identifiable information to be able to link, say, your pharmacy record with your emergency room record.
We worked with the NIH to create tokens across this ecosystem from pharmacy, public health, and most of the major medical research institutions in the country that were part of a research program at NIH. That allowed us to pull together all the data and then create data sets that basically said, here are the folks who got vaccinated. Here are the folks who got hospitalized. Here are the people who had long-term complications related to that. That has provided a lot of rich research for the folks at the NIH who are doing that.
We see other use cases in life sciences. When pharmaceutical companies want to do a clinical trial, they get consent to collect information as part of participation in a clinical study. They have identifiable information that they use for that study. But it’s important for drug safety to be able to monitor patients after they have left a clinical study to see if they have long-term follow-up or other things that may happen as part of their participation. That can be tremendously expensive. Those are called Phase 4 clinical studies.
We have found that a lot of life sciences companies are getting permission to tokenize the information of those patients and their record. Then they can find that patient at a population level — not at an individual level, but at a population level — to identify cohorts of patients that might, say, have an increase in their cancer risk. O they may find that their five-year follow-up was fine, but their 10-year follow-up might be more challenging. That has been tremendously valuable within real-world evidence and using that for clinical studies in the life sciences. By creating those tokens as part of that process, they are able to do a lot more of the Phase 4 studies, which are expensive and they take a long time, but to do those efficiently by using this real-world data and being able to collect it directly.
As this becomes increasingly relevant, we are finding that a lot of hospitals and providers are starting to see de-identified data as not just a nice-to-have, but part of a strategic approach to how they use data. For example, within a large-scale academic medical center, there are hospitals that will de-identify and tokenize these very large data sets, and they’ll have them within their institution. They provide the ability to link that data together and reduce the risk of breaches, reduce the risk of other problems, because the data has already been de-identified and can then be used for research purposes.
Other hospitals are taking a look and using de-identification to enhance the data that they already have. They might create tokens within their hospital, but use that as a way of drawing in other data, matching it into their population, and being able to do a richer analysis at a population health level because they have augmented the data with mortality data or with social determinants of health data that allows them to get a better picture of their population. Again, not to the individual patient level, but at that population level.
Many of the providers are using this data to participate in some of these clinical studies, to be able to take their data, de-identify it, and then make it accessible to life science companies and to people who are doing research in a way that is respectful of the patient’s privacy and that prevents that lazy data. They are able to have the data that has been collected as part of their provision of care and make it be useful for other purposes that advance our understanding of how to deliver better health and healthcare.
Could tokenization be used by an EHR or other system to de-link a patient’s identity from their detailed information so that if a hacker exfiltrated their entire database, they still couldn’t connect a patient’s identity to their data?
This whole notion of being able to take two data sets potentially that have been tokenized and not be able to link them together is a fundamental part of the Datavant technology. We have probably 100 billion records and 300 million covered lives that have been tokenized using the Datavant technology. Should someone inadvertently get a copy of, say, one hospital’s tokenized data and the records from another hospital’s tokenized data, our system creates different tokens for each of those sites so that it’s impossible, even if someone were to get that information, to be able to link it together and potentially re-identify a particular patient.
If you had a list of everybody’s name, and you tokenize that and then use that to link to other data sources, as soon as you got a link, you’d say, “I know the name of this person.” We don’t allow those kinds of linkages to occur except under strict review. We also do other reviews to make sure that, even after you’ve linked the data, it is no longer re-identifiable. That’s a fundamental piece of the puzzle.
To your second point, how does an organization reduce their liability or risk if somebody were to breach their system and get access to this data? Obviously, if you have lots and lots of research data sets that are lying around that have identifiable information, the more identifiable information you have, the greater the risk. If, however, you have those data sets that have been de-identified, but it’s still possible to link them together even within your own institution, there are organizations that use that as a way of helping mitigate the risk around research data and still make it useful to people, because it’s not as if you’ve de-identified it and now it can only be used for one purpose. You can de-identify it, but by making sure you’ve got those tokens, you can still then reassemble different kinds of data sets for different purposes as long as you’re being very careful that the risk of re-identification remains low.
If FDA receives tokenized data that requires urgent follow-up with individual patients, would it be possible for them to go back to the contributing source?
If it’s your data, if you’re a provider and you have data within your electronic health record, you can maintain a look-up table that will have the patient’s identity, your medical record number perhaps, and the token assigned to that as well. But that would be something that an individual hospital would maintain and it would never become public knowledge. So the short answer to your question is, absolutely, if the FDA said, “There’s a safety concern, and we’ve identified within this population that there are specific patients that we need to reach out to,” you can go back to the contributing hospitals and you can ask them that question – “We have some folks, here are their tokens, can you help us identify who they are?” If that organization has maintained that look-up table, then yes, we can get back to those things for those safety needs that the FDA or others might have. That look-up is not something that Datavant does. That would be something that would be within the purview of the owners of the data.
Is there a consistent de-identification method that is being used by all these companies, EHR vendors, and even providers themselves who are selling de-identified patient data?
We take maintaining the de-identification of the data pretty seriously. We provide the ability to remove the PHI and to add in the tokens. But you can imagine, you might have one dataset that is perfectly de-identified and another dataset that is perfectly de-identified, but when you combine them, you increase the risk of re-identification.
Suppose the first dataset has specific diagnostic information and the second dataset has specific geographic information. You combine those two and you might say, we have a geographic area in which there’s only a single diagnosis of this particular disease. That becomes highly re-identifiable if somebody connects some of the dots. De-identification, in and of itself, doesn’t necessarily mean that it can’t be re-identified when combined.
For folks who have complex data or complex linkages, we always recommend expert determination, which is a statistical approach to analyzing the risk of re-identification. You can run a series of algorithms across the dataset that can tell you that you have too much geographic specificity or diagnostic specificity. Given the kind of study that you’re trying to do, maybe we need to aggregate this at a less granular geographic area so that you can still ask the questions that you want about the details of a particular diagnosis. That expert determination is a way of assuring, even if the data has been de-identified or linked to other data sources, that you remain compliant and that the risk of re-identification remains low with those datasets.
What kind of expert performs the expert determination?
There aren’t a lot of rules out there around this. A provision within HIPAA says that expert determination is the statistical approach that has a low-to-no risk of re-identification. Typically, you have academicians who are doing expert determination. It’s really about controlling the release of information in a way that has statistical controls around it. There are companies that do this.
Within Datavant, we have a firewalled relationship with a company, Mirador Analytics, that does this expert determination. They work essentially independently when it comes to the expert determination effect. But it’s offered as a service so that people who are doing this tokenization and then linking have the ability to then, in an efficient manner, determine whether there is a risk of re-identification. There’s a whole host of folks that are out there, from academicians that have a shingle and they do a good job of this, to an organization like Datavant that provides that as a service to folks who use our tokens.
You’ve seen healthcare grow data-rich going back to your days working on Meaningful Use. What issues remain on the table for using the wealth of data that is suddenly available?
The Institute of Medicine had a series of articles going back 10 or 15 years — I think it predates some of Meaningful Use work I did at ONC and has has continued since then – describing this notion of the learning healthcare system. To me, that is a societal goal that I would love to see, where every interaction that a patient has with our healthcare system becomes an opportunity to learn how to take care of the next patient, and the next patient after that, in a better way.
There’s a whole host of problems that we have to overcome to get there. One of them that Datavant is addressing is that when your data gets fragmented and you want to get that longitudinal record, is there a way you can do that that preserves a patient’s privacy?
We have got lots and lots of regulatory frameworks in which your data is used. If you are a student and download your student healthcare record, combine it with your electronic health record information, download it to your Apple Watch, and then use that information on your Apple Watch to support a clinical trial, you will have traversed five different regulatory frameworks. People tend to think that if it’s health data, it must be covered by HIPAA, and that’s not the case. For the data that is in an app or that is part of a commercial venture, it’s that 80 pages of stuff that you just scroll through and you click OK because you want to be able to use the app that defines what they can do with your data. One of the things that we’re going to have to address is getting a consistent way in which we address privacy.
The last thing I’ll say about that is that because there is this notion and there are some concerns that data that is outside of the healthcare environment may need some additional protections that the FTC or that Common Rule or whatever doesn’t necessarily cover, we are seeing a lot of states that are starting to come up with their own privacy rules about how health data gets managed. We run the risk of having inconsistent definitions of what de-identification and expert determination is, and that’s going to create a tremendous burden on the industry and it’s going to create potential holes in which patients’ privacy could be otherwise compromised.
As we begin to solve these technical problems, there becomes other kinds of problems that come up. Keeping consistency across all of the different states, as well as integrating the different frameworks that we have, even at the federal level, becomes important, because if we’re going to use data in this learning healthcare system, we need to have consistent, reliable, and effective means of making sure that patients’ privacy is protected and done in a consistent way.