Curbside Consult with Dr. Jayne 8/5/24
Throughout my career in healthcare IT, I’ve seen the unintended consequences that can be found with the implementation of new technologies. As an example, we can look at EHRs and how they made it easier for physicians to capture the details of the care they were providing and to bill accordingly for it. As a result, previously bell-shaped distributions of billing codes started skewing towards more complex (and therefore higher revenue) codes, leading to increased audits and insurer crackdowns. The additional documentation that was generated by EHRs was treated with more scrutiny, and some physicians became reluctant to use the solutions that were supposed to make things better, manually lowering calculated billing codes to avoid the hassle of audits.
As clinicians begin to incorporate technologies such as generative AI into daily practice, it’s important for researchers to diligently assess the solutions to ensure that they are enabling safe care and to monitor for unintended consequences. Every time I see a real-world study addressing this issue, it reminds me how rewarding it can be to practice clinical informatics. A study was published last week in JAMA Network Open that looked at the issue of using large language models to generate responses to the communications that patients send to their care teams through EHR patient portals.
The first thing that I noticed about the article were the listed author affiliations. Although they were all from New York University, they represented not only the NYU Grossman School of Medicine, but also the NYU Stern School of Business and the NYU Tandon School of Engineering. The specific question the authors were investigating was this: can generative artificial intelligence (GenAI) chatbots aid patient-health care professional (HCP) communication by creating high-quality draft responses to patient requests?
The study was conducted at NYU Langone Health, specifically using responses that were created in three internal medicine practices that were piloting a generative AI solution. Sixteen primary care physicians were then asked to evaluate messages but were blinded to whether the messages were drafted by GenAI or by human healthcare professionals.
The primary care physicians who were evaluating the messages were recruited from the organization’s internal medicine listserv, with only 16 of 1,189 physicians volunteering. That’s barely more than 1%, which although surprising at face value, really isn’t that surprising, given the stresses that many primary care physicians face on a daily basis. The sample was 50% female, with practice locations split between NYU Langone Health, Bellevue Hospital, and the Manhattan Veteran’s Affairs hospitals. They rated the messages on content quality, communication quality, and whether the reviewer felt a draft was usable or whether they’d prefer to start over with their own response.
During an initial survey, reviewers received five to eight pairs of responses without any follow-up questions. A subsequent survey contained 15 to 20 pairs of responses with additional follow up questions to assess characteristics such as empathy, personalization, and professionalism. The response pairs for the first survey were drawn from 200 random in-basket messages that were extracted from the organization’s EHR in September 2023. Messages that required outside context, such as laboratory results or medications, were excluded. Those from the second survey were pulled a couple of weeks later, with an initial sample size of 500 messages. The same exclusions were applied.
The study corroborated one finding that we’ve seen before, that GenAI responses may demonstrate greater empathy than human-crafted messages. However, I was surprised by some of the other findings. AI-generated responses tended to be “longer, more linguistically complex, and less readable” than those that were created by human respondents. The authors concluded that these could be problematic for patients with lower health literacy, or those for whom English is not their primary language.
The authors also found that certain types of messages, including those involving laboratory results, may need enhanced prompt engineering to be useful. They noted some limitations to the study, including the fact that it was conducted at a single facility and that the sample size was small. It would be interesting to see how physicians at community hospitals or community health clinics would rate the responses in comparison to colleagues who are practicing at larger medical centers or hospital-affiliated clinics. They also noted that they didn’t assess whether templates were used for those extracted messages that were drafted by healthcare providers and recommended that templated responses should be treated as a separate comparison group in future studies.
It will be interesting to see how similar responses might be graded over time, as people become more used to seeing AI-generated responses. Similarly, technologies may evolve to include more human or colloquial speech patterns in AI-generated drafts. For those of us who have moved from region of the country to another, or who have transitioned from academic medical center environments to community health centers, we could also see our own speech and writing patterns change accordingly. This may also vary generationally depending on when physicians completed their residency training and by specialty.
For example, some specialty training programs, including primary care, give more attention to health literacy and communication topics than do others, such as the procedural subspecialties. As a primary care physician, when I’m graded on how well I can use words to convince my patients to receive a vaccine or to go for a colonoscopy, I think much more carefully about what I’m saying and how I say it than others who are not scored in such a manner. As large language models evolve and appropriate feedback is applied, we should see responses that grow closer to what we need to provide the best care for our patients.
I’ll be on the lookout for additional studies that look at these topics, but I know my limits as far as being able to see everything that turns up in the literature. Here’s to hoping that my colleagues clue me in when they see one of these topics, and I always appreciate it when our readers give us a heads up that something interesting is available for our perusal.
What do you think about using AI-generated drafts to help clinicians respond to patient messages? Are you using it in your organization and how is it going? Leave a comment or email me.
Email Dr. Jayne.
Going to ask again about HealWell - they are on an acquisition tear and seem to be very AI-focused. Has…