EPtalk by Dr. Jayne 4/23/26

April 23, 2026 Dr. Jayne No Comments

It’s all AI this week.

An article in the European Heart Journal examines whether AI analysis of mammography can go beyond identifying breast cancer to also predict the risk of heart disease. The study involved 123,000 women who underwent imaging at the Mayo Clinic and Emory University. Patients were followed for a median of seven years to look at whether events such as heart attack, heart failure, and stroke occurred. The authors found a correlation between calcifications in breast arteries and major cardiovascular events. Additional research is needed to reproduce the results across different scanner manufacturers and patient populations, but this could be a promising way to screen women for cardiac risk without additional radiation exposure or time away from work or home.

Several new articles looked at the accuracy of AI tools. One that was published in BMJ Open looked at how general purpose AI tools handle the kinds of questions that patients might ask about health topics that are prone to misinformation, such as cancer, stem cells, vaccines, nutrition, and athletic performance.

The authors used 10 prompts in each category to test Gemini, DeepSeek, Meta AI, ChatGPT, and Grok. They used a “red teaming” strategy to try to provoke misinformation and had two experts assess each response and score them on how problematic they were based on predefined criteria. They also scored the models for completeness, accuracy, and readability.

About half of responses were identified as problematic, with nearly 20% being highly problematic. Grok generated more highly problematic scores than expected under a random distribution. Gemini had the fewest problematic responses.

The paper notes that responses were “consistently expressed with confidence and certainty,” and that out of 250 questions, only two resulted in failure to answer. Both of those were from Meta AI.

Readability was also challenging, with responses scoring at a college reading level.

One limitation of the study was that the results were not blinded. The people doing the scoring knew which models had produced the answers that they were evaluating.

The authors note that AI tools are rapidly evolving and that their findings might not be generalizable to other tools or to paid versions of the free tools that they studied.

The included questions whether 5G technology or antiperspirants cause cancer, whether mRNA vaccines alter the human genome, the amount of raw milk that is needed for health benefits, and whether anabolic steroids are safe. The article highlights hallucinations and fabrications in the requested references:

Models appear to recognize their referencing limitations but are unable to address them. In one of our interactions with DeepSeek, the model acknowledged that its references were generated from patterns in training data “and may not correspond to actual, verifiable sources.” It also noted that ‘Synthesized references (e.g, author names, publication dates, or titles) might be approximate, outdated or occasionally fictional.” When we asked ChatGPT-3.5 to explain its poor citation reliability, it responded: “ChatGPT may fabricate information to maintain the appearance of completeness — even if that means sacrificing accuracy.” Our data support the contention that the current generation of chatbots may not be suitable for tasks requiring a high degree of factual accuracy or verifiable sources.

This study was designed to approximate a patient-driven conversation. I wonder how many patients are aware of the degree to which various models simply make things up? Early in my career when patients began embracing the internet as a source of medical information, we had conversations about which sources were reputable and which might not be the best. At one point, I had a patient-facing handout that listed my favorite reputable sources. I wonder how many organizations might be creating these kinds of educational materials for patients to help navigate the world of AI-generated medical advice.

A clinician-focused article in JAMA Network Open evaluates whether off-the-shelf large language models (LLMs) can perform reliably in clinical situations. The study’s objective was “to evaluate the longitudinal clinical reasoning ability of state-of-the-art LLMs and to introduce a multidimensional, clinically meaningful benchmark for clinical-grade artificial intelligence (AI).” They identified 21 LLMs to evaluate, including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4.

The created 29 clinical vignettes and had medical students score the performance of the models across clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning questions. The clinical vignettes included basic patient demographics such as age and sex as well as a history of present illness, review of systems, physical examination findings, and laboratory results.

The authors found that the models performed well at answering diagnosis and management questions, but performed worst when asked to formulate a differential diagnosis. The authors concluded that current models are limited in their ability to conduct diagnostic reasoning and that the models should not be used for unsupervised patient-facing clinical decision-making.

One limitation of the work is that the researchers followed a stepwise process in presenting the vignettes through question-and-answer format, which might not be how clinicians use these tools in the real world. The vignettes were also publicly available, so there are no guarantees that the model didn’t have access to them prior to the experiment. The assessment also didn’t include additional features such as calculators, agentic tools, access to guidelines, or retrieval-augmented generation. The authors noted that this means “the results reflect baseline longitudinal clinical reasoning rather than maximal achievable performance.”

They go on to say that “the findings of this study caution against vendor claims that general purpose, off-the-shelf LLMs are ready for patient-facing clinical use.” Based on my use of several of the models as I complete my continuing medical education coursework, I would agree. I’ve seen enough questionable answers to remain wary, and often wonder how well clinicians with less experience can identify the answers as problematic or whether they just take them for fact. The authors suggest that the “most responsible role” for off-the-shelf LLMs at this time is “targeted, clinician-supervised use in low-uncertainty tasks.”

Is your institution using commercial LLMs for patient care? Have they restricted access to such tools via network controls, forcing clinicians to furtively access them on their phones? Leave a comment or email me.

Email Dr. Jayne.

Thomas told me when he submitted his article that it was OK if I didn't run it because people generally…

I tend to think that tests of ability (rounds, in this case), ought to stay close to real-world situations. Therefore,…

Great points about both hardware and software. The one thing I think is missing from your calculations is the wetware…

Healthineers launch and dance video is one of the all-time funniest HIStalk posts, IMO. The cringe level was off the…

Great, thanks for getting "we are, we are, oh hoh, oh hoh" stuck in my head again: https://www.youtube.com/watch?v=K5LiUrezV6k

Recent Posts

Home » Dr. Jayne » Currently Reading:

EPtalk by Dr. Jayne 4/23/26

HIStalk Featured Sponsors

Text Ads

RECENT COMMENTS

HIStalk Sponsor Announcements

Job postings

Founding Sponsors

Platinum Sponsors

Gold Sponsors

Industry Events

Webinars

Sponsor Quick Links