Readers Write: “The Illusion of Thinking”: Implications for Healthcare
“The Illusion of Thinking”: Implications for Healthcare
By Vikas Chowdhry
Vikas Chowdhry, MS, MBA is founder and CEO of TraumaCare.ai.
If you are even moderately interested in AI, I am sure you have by now at least seen various comments and responses in social media to Apple’s paper titled “The Illusion of Thinking.” But in case you have been under the AI rock, here’s a brief summary.
In this paper, the authors show that today’s large reasoning models (LRMs such as OpenAI o3-mini, DeepSeek-R1, Claude 3.7 Sonnet-Thinking) — systems that explicitly generate long chains-of-thought — really do think more, but not necessarily better. On carefully designed puzzle tasks, they beat ordinary LLMs only in a narrow middle band of difficulty and then collapse outright as problems grow harder.
As expected, the comments span the gamut, from “the sky is falling” to “not a big deal, they will figure out a way to overcome or fix this.” While I am not in the “sky is falling” camp, I do think that this paper raises some important questions with special implications for healthcare. Any healthcare organization (or vendor) that is using or developing a product that is based on LLMs/LRMs will need to think deeply about these issues and have a strategy to run their own similar evaluations and hopefully share them publicly.
Here are four key findings from the paper and my take on the implication of each finding for healthcare.
#1. Impact of complexity on reasoning performance
The authors identify three performance regimes as problem complexity rises:
- Low complexity: standard LLMs are more accurate and efficient than LRMs.
- Medium complexity: LRMs pull ahead.
- High complexity: both collapse to zero.
Performance of LRMs (solid lines) and LLMs (dotted lines) across low, medium and high complexity puzzles (figure from the Apple paper).
- How will you define complexity thresholds in your workflow?
- Does your system dynamically choose between an LLM and an LRM based on a case’s difficulty?
- Can it detect when a case crosses a threshold and alert the clinician instead of forging ahead with low-quality output?
LRMs spend more tokens as tasks get more complex until a critical point, after which, they give up and begin to reduce their reasoning effort despite increasing problem difficulty. This behavior suggests a fundamental scaling limitation in the thinking capabilities of current reasoning models relative to problem complexity.
Let’s say your product helps detect malignant tumors, or, transcribes ambient conversations using LLMs/LRMs.
- In operational mode, does it have mechanisms to detect that the case has crossed a complexity threshold and that it is giving up, and that at that point, humans need to stop using it for that case?
- What happens if the AI product was sold as a tool to make your apps take on more primary care responsibilities, and now that the product has given up, what’s your recommendation for the NP who was relying on your product?
- What if your product doesn’t even have the awareness that it has given up and the NP continues to rely on its output? Who owns the risk for a misdiagnosis?
#3. Over-thinking & self-correction limits
For simpler problems, reasoning models often find the correct solution early in thinking, but then continue exploring incorrect solutions (overthinking). As problems become moderately more complex, this trend reverses: correct answers appear only late. For hard tasks they never appear (“collapse” as discussed earlier).
- Over-thinking wastes compute and drives up cost.
- Yet aggressively pruning the chain of thought might remove the only path to a correct answer on tougher cases.
- Your system therefore needs complexity-aware throttling, not a one-size-fits-all token limit.
#4. No benefit from explicit algorithms
Prompting with a known algorithm to solve the problem does not improve the performance. This indicates weaknesses in faithfully executing step-by-step logic, not just in discovering it.
A healthcare organization may have explicit clinical guidelines for certain use cases and would want the AI product to follow them when those guidelines are met. However, the results of this paper show that an LLM/LRM based on AI product may not be able to execute an algorithm based on those guidelines even when explicitly programmed into the system.
- Embedding clinical guidelines verbatim is not enough.
- You must verify that the model can faithfully execute those step-by-step protocols under real-world complexity.
AI progress is breathtaking, yet deploying it in high-risk domains like healthcare demands transparent, domain-specific safety testing. This paper is a timely reminder that such work takes time, expertise, and openness. Sharing evaluation results will accelerate safe adoption for the entire industry.
Beholder's Share can be supported in software without incurring much technical cost by supporting cosmetic configuration. Some Epic reports allow…