Readers Write: Lessons from the ChatGPT Health Debate
Lessons from the ChatGPT Health Debate
By Robert Stewart
By Robert Stewart is CTO of Arbital Health.
A recent column by Geoffrey Fowler in The Washington Post that describes his disappointing experience with ChatGPT Health sparked discussion in the health IT community. While many remain optimistic about the long-term potential of platforms such as ChatGPT Health and Claude for Healthcare, Fowler’s piece highlights issues that healthcare leaders, clinicians, and technologists should examine carefully.
Variability and inaccuracy are not unique to large language model (LLM)-based systems. Many clinical diagnostics have known false-positive rates, and repeat testing is routine when results are unexpected. Clinicians themselves may reach different conclusions when presented with the same clinical information months later. Medicine has always operated within a probabilistic framework.
What is different with LLM-driven systems is their non-deterministic behavior when given the same input repeatedly. Identical prompts can generate materially different responses. Fowler demonstrated this when ChatGPT assigned his cardiac health scores ranging from a B to an F using the same underlying data. That level of variability can cause confusion or anxiety when applied to personal health interpretation.
Many consumer health AI tools are built on retrieval-augmented generation (RAG) architectures, in which the model is grounded using user-specific information such as medical records or wearable device data. Even when anchored to structured inputs, however, the LLM’s narrative interpretation can still vary, reinforcing the need for clinician oversight and appropriate guardrails when deploying these tools in consumer health settings.
It’s also important to recognize the potential psychological impact of these tools. Researchers such as Eric Topol caution against indiscriminate screening of asymptomatic individuals because it often produces “incidentalomas,”(findings that lead to unnecessary follow-up testing or treatment without improving outcomes. Consumer AI health scoring systems risk amplifying this phenomenon by continuously surfacing probabilistic interpretations in the absence of appropriate clinical context.
Wearable Data Challenges
Wearable device data introduces another layer of complexity. Anyone who works with longitudinal wearable datasets understands that the signal-to-noise ratio is inconsistent. Devices are removed for charging, replaced every few years, or switched across vendors that have different calibration baselines. Environmental and behavioral factors such as travel, altitude changes, illness, stress, or sleep disruption can produce statistically significant physiological changes that an AI system may misinterpret without broader context.
Jessilyn Dunn, PhD and her lab at Duke University have conducted extensive research that uses machine learning and statistics to extract valuable insights from consumer wearables, but the work remains challenging. Even highly targeted machine learning applications, such as arrhythmia detection platforms developed by companies like AliveCor, still operate with non-trivial false-positive rates. Wrapping a general-purpose LLM around wearable data without similarly rigorous modeling layers is unlikely to deliver clinically reliable outputs.
Security and Privacy Considerations
As consumer AI health tools evolve, security becomes increasingly important. Anyone who uses ChatGPT, particularly those who are sharing sensitive health information, should enable multi-factor authentication (MFA), which is one of the most effective controls for reducing account compromise risk.
Users should also recognize an important regulatory distinction. Information that is entered into consumer AI services is generally not protected under HIPAA. OpenAI’s enterprise offering, ChatGPT for Healthcare, is designed for HIPAA-covered environments and supports Business Associate Agreements (BAAs), but consumer versions operate under different legal frameworks.
The Takeaway for Health IT Leaders
The lesson from Fowler’s experience is not that consumer health AI lacks value, but that context, governance, and clinical integration matter. Non-deterministic systems that interpret noisy consumer data can easily generate variable outputs that users may misunderstand as clinical conclusions rather than probabilistic insights.
For health systems, payers, and digital health innovators, the near-term opportunity lies in combining LLM interfaces with validated predictive models, strong clinical workflow integration, and transparent communication about uncertainty. Without those guardrails, even well-intentioned consumer health AI tools risk creating confusion rather than clarity.
ChatGPT reactions to the WSJ article on the flatulence metric: Finally, a wearable that really tracks your *gas mileage*. “Prodigious…