Home » Readers Write » Currently Reading:

Readers Write: Two Curves, One Hospital Server Room: Why On-Prem AI in Healthcare Is Inevitable

April 22, 2026 Readers Write 2 Comments

Two Curves, One Hospital Server Room: Why On-Prem AI in Healthcare Is Inevitable
By Thomas Kantecki

Thomas Kantecki is a health informatics student at the University of Central Florida.

Model efficiency and GPU power are compounding quickly. Where they intersect, sending protected health information to somebody else’s data center stops being the smart tradeoff that it was in 2020.

I think health IT is headed for a reversal that most people who are running clinical infrastructure have not fully priced in yet. My prediction is that within the next five to 10 years, a meaningful share of hospitals and health systems will run their clinical AI workloads on hardware that they own, inside their own networks, on data that never touches a WAN link.

The driver is not a regulatory panic and not a breach cycle. It is math. Two curves are bending at the same time, and where they meet, sending protected health information to external data centers stops being the smart tradeoff that it was in 2020.

The First Curve: Models Are Becoming More Efficient

Five years ago, the idea that a hospital could run a capable large language model on a single workstation would have been laughable. Today it is routine.

The reason is not that hospitals bought bigger computers. It is that the models got smaller for the same quality.

A few things drove this. Mixture-of-experts architectures mean a 400B-parameter model can run with only 20B or 30B active at any given token, collapsing the memory and compute cost of inference. Quantization techniques like Q4 and FP8 cut the memory footprint of a model by 2x to 4x with minimal quality loss. Distillation has made smaller models remarkably competitive with their larger teachers on the specific tasks hospitals actually use them for: summarization, structured extraction, ambient documentation, code assistance.

The RTX PRO 6000 Blackwell, a single desktop GPU, can run a 70B-parameter model at FP8 with room left over for the KV cache. Five years ago you needed a small cluster to do that. The curve does not show signs of flattening. The same techniques that gave us 70B on a desktop are now giving us 400B on a server with better clinical performance than the 175B frontier of 2023.

The Second Curve: GPU Power Is Compounding Fast

Hardware is advancing just as quickly. Blackwell delivered roughly 2.5x better throughput per dollar than the prior Hopper generation, with inference costs dropping by up to 10x for workloads that can exploit the architecture. Nvidia is not slowing down, AMD is closing in on memory-bandwidth parity for inference, and purpose-built inference silicon from Groq and Cerebras has shown that the token-per-second ceiling has not been reached.

For hospital procurement, a capital purchase in 2026 to run inference on site will be outperformed by a capital purchase made in 2028, which will be outperformed by one made in 2030. The depreciation curve is real. It is also the same curve that makes each successive refresh cycle dramatically more capable for the same rack space and power budget.

Where the Curves Cross

The interesting moment is not today. The interesting moment is what happens when these two curves meet somewhere around 2028 or 2029.

At that point, a midsized hospital will be able to run what is effectively a frontier-class model, fine-tuned on its own de-identified corpus, on a single rack of GPUs sitting in a room the compliance team already has physical control over.

The performance difference between that local model and a cloud-hosted frontier model will be small enough that it stops mattering for the clinical tasks that actually touch PHI, such as drafting discharge summaries, ambient scribing, chart synthesis, and structured data extraction from notes.

At that point, the calculus changes. Today, CIOs ask whether hospitals can afford to run this on-prem. The question they will ask in 2029 will be why the hospital would send this off-prem at all.

Why Privacy Starts to Win the Argument

HIPAA Business Associate Agreements are legal instruments. They allocate liability and create enforceability. They do not change the physical reality that PHI is leaving the facility, traversing the public internet, and being processed on hardware that the hospital does not control, shared with tenants that the hospital has never met.

The expected 2026 HIPAA Security Rule removes the addressable designation for encryption and tightens requirements around asset inventories, network maps, and breach timelines. Those requirements are easier to satisfy when the GPU doing the inference is in a rack the hospital owns.

The privacy argument is not new. What changes is that the performance penalty disappears. Today, running only a local model means accepting a real capability gap versus the frontier. In three to five years, that gap will be small enough that the privacy side of the ledger dominates.

This Is an Opinion, Not a Forecast

I want to be honest about what this is. This is a prediction based on two trend lines and a belief that, given the choice, compliance officers and patients both prefer the version where the data never leaves the building. I could be wrong. Cloud vendors will keep improving their HIPAA posture. Edge-inference startups might hit supply chain walls. Regulatory bodies might carve out cloud-friendly exceptions.

The underlying trajectory is hard to argue with, though. Models keep getting more capable per parameter. GPUs keep getting more capable per watt. Patient data keeps being the single most sensitive dataset in the entire economy. Put those three facts together and the eventual destination is obvious. Clinical AI inference belongs inside the hospital. It just does not belong there yet.

The server room is coming back. The only open question is when.

Currently there are "2 comments" on this Article:

Tim K says:

April 22, 2026 at 10:22 am

Great points about both hardware and software. The one thing I think is missing from your calculations is the wetware – on-prem systems require employed or contracted human beings to keep them up and running (and backed up/failed over/upgraded/etc.). Especially for mission-critical systems (as these local models would be), qualified admins would be able to more or less set their own price

Reply
Mr. HIStalk says:

April 22, 2026 at 3:57 pm

Thomas told me when he submitted his article that it was OK if I didn’t run it because people generally aren’t interested in what a student has to say. I disagreed, saying that I found his piece authoritative, well written, and a refreshing break from PR-written vendor pieces. Give him some encouragement if you enjoyed his work.

Reply

Thomas told me when he submitted his article that it was OK if I didn't run it because people generally…

I tend to think that tests of ability (rounds, in this case), ought to stay close to real-world situations. Therefore,…

Great points about both hardware and software. The one thing I think is missing from your calculations is the wetware…

Healthineers launch and dance video is one of the all-time funniest HIStalk posts, IMO. The cringe level was off the…

Great, thanks for getting "we are, we are, oh hoh, oh hoh" stuck in my head again: https://www.youtube.com/watch?v=K5LiUrezV6k

Recent Posts

Home » Readers Write » Currently Reading:

Readers Write: Two Curves, One Hospital Server Room: Why On-Prem AI in Healthcare Is Inevitable

HIStalk Featured Sponsors

Currently there are "2 comments" on this Article:

Text Ads

RECENT COMMENTS

HIStalk Sponsor Announcements

Job postings

Founding Sponsors

Platinum Sponsors

Gold Sponsors

Industry Events

Webinars

Sponsor Quick Links