Readers Write: Harnessing the Full Potential of AI in Healthcare Requires Carefully Prepared and Clean Data
Harnessing the Full Potential of AI in Healthcare Requires Carefully Prepared and Clean Data
By Brian Laberge
Brian Laberge is solutions engineer at Wolters Kluwer Health.
Artificial intelligence (AI) implementation in healthcare is gaining more and more traction. However, messy data can lead to challenges in training these platforms and helping uncover bias to ensure they offer the most impact. With 80% of healthcare data existing in unstructured formats, there’s often an extra step required to map these insights to more structured standards, enabling AI algorithms or large language models to parse through the information and distill takeaways in a clear and comprehensive way.
As the saying goes, garbage in means garbage out with these platforms. To fully embrace large language models in healthcare and capitalize on the opportunities for AI, it’s important to acknowledge the data quality challenges to overcome and tips for maintaining clean data for optimal use of advanced technologies.
When considering the use of AI in healthcare, there are two phases to consider — the training of the technology and the implementation and insights that will ultimately be delivered. When thinking about training the technology, one of the biggest challenges with healthcare data in particular is consistent data quality and accuracy. With multiple standards across healthcare, and valuable information stored in unstructured fields, it can be difficult to map insights from one care setting to another and ensure that data doesn’t lose meaning amid these bridges.
Additionally, lab or medical data often comes back with portions incomplete, inaccurate, or lacking validity, which skews the data from showing AI models the full picture. Adding further complexity, physicians often use different clinical verbiage to mean the same medical term. All of these data quality issues can result in a hallucination, where the model perceives a pattern that doesn’t exist, which results in made-up, incorrect, or misleading results. Knowing what those synonymous phrases are and being able to address them when training new models or tuning an existing large language model can help increase accuracy.
Another challenge comes from deciphering clinical notes. When you get a mix of data, these notes need to be extracted and properly codified to an industry standard. If this process cannot be completed, it’s often recommended to exclude them, as the data will lead to noise and bias within the AI models. This gap could represent a huge loss of insights that could be incredibly impactful for patient care and outcomes reporting.
In general, human error, or simply the large amount of disparate verbiage used in healthcare, doesn’t always translate easily for a uniform standard to train AI. In order to avoid this, healthcare organizations should make sure they have tools or processes in place to assess the quality of their data, clean their data, and standardize it before implementing LLMs.
Though it can be challenging to fully prepare data before training an AI model, it’s imperative to ensure that future AI use and insights are purposeful and accurate. It can be dangerous to train an AI with messy data for a number of reasons. Missing, incomplete, or incorrect information can reduce the accuracy and insert bias, which could lead it to infer incorrect assumptions that are then built into the core of the model.
Additionally, low quality or overly simplified data for minority populations could cause a bias to be built into the model. In data, race and ethnicity often are jumbled together. Sometimes, because of biases within the healthcare system itself, there is not as much data for certain groups compared to another. While addressing those care gaps is a much larger discussion, staying ignorant about the fact that the data gaps exist is also dangerous.
For example, if you are building a model to predict the most effective drug for a patient based on historical administration of various drugs, and the data used to train the model has data quality issues with race, then it is more likely not to detect a situation where a drug is more effective for a particular race and would result in a bad recommendation.
Maintaining the data, including knowing where the gaps are, and evaluating training data to address these gaps is a challenge. However, it’s essential to address from the get-go as bias or inaccuracy in the model will make the system harder to use, and ultimately, these biases will then be intrinsic to the AI platform and future insights.
Integrating data, particularly high-quality data, is proven to save hospitals money and reduce risks to compliance and industry standards. There are six core elements to maintaining data quality that organizations should consider when preparing to implement AI tools:
- Accuracy is important in reflecting the true outcomes of healthcare.
- Validity assesses the appropriateness of the data to support conclusions.
- Data integrity ensures the reliability of the data.
- Having complete data helps to identify any possible gaps within the data set.
- Consistency is important to maintain uniformity across the set.
- Timely data helps to harness the full potential of the data for meaningful actions.
All of these qualities will strengthen the data and create an easier AI implementation with less room for error.
While maintaining clean data for use by advanced analytic platforms can be challenging, there are steps that organizations should take to keep data ready for use in AI models. First, it’s important to have a strong data governance process to ensure accurate data, and to decipher good versus bad data before feeding it to an AI model. It’s also important to verify lab results against the appropriate codes to eliminate errors and incorrect codes being built into the model. We have found in one data set that the data quality was as low as 30% accurate as it contained invalid codes and incorrect codes for the labs.
Ensuring alignment of data, and validating codes to an industry standard, will help to streamline the process. The richer the data used to train the AI, the better the outcome will be. Normalizing and mapping the data can help to streamline data from multiple sources and authors. Mapping the information ensures accuracy in the data and helps break down any discrepancies between sources.
Lastly, constantly assessing and ensuring an understanding of data from the team that is responsible for training the model will help to identify gaps or potentials for biases within the data itself. It’s important for the team that is training the model to work with their data governance colleagues to ensure that they are aware of any missing data, such as gaps in lab results and member data, to remedy these gaps for more complete quality measure reporting.
By implementing these best practices, data can be properly utilized to its full potential to inform decision-making, increase quality, and enhance patient care.
Healthcare data can be messy, but creating a process where the data is properly assessed and cleaned can be beneficial in so many ways beyond AI. It’s encouraging to see an industry that has historically moved slowly be so eager to adopt new technologies. While the opportunity for AI use in healthcare is great, we can’t forget the basics of data quality that are essential in determining the future success of these platforms. With this process, organizations can make better use of AI and ensure the most accuracy in their models to help better serve patients.
I've spent some time at the front of the classroom, but I've spent much more time in the lab studying…