The last 43 hours has been some of the most agonizing time I’ve spent in the IT trenches in recent memory. I’ve been working with a client on a small CMIO augmentation project, mostly helping them get organized from a governance and change control standpoint. It’s a mid-sized medical group, roughly 80 physicians, but none of them want to take time away from patient care to handle the clinical informatics duties. I suspect that this is because they’re mostly subspecialists and there’s no way the group would be willing to compensate them for the time they would miss from their procedural pursuits.
Until I arrived on the scene, the IT resources would just build whatever the physicians wanted, regardless of whether it made sense for everyone. This in turn led to a whole host of issues that is impacting their ability to take the upgrades they need to continue participating in various federal and payer programs.
I’ve been spending eight hours a week or so with them, mostly on conference calls as they work through a change control process. Much of my work has been in soothing various ruffled feathers and in trying to achieve consensus on issues that have to happen regardless, but I hope to get them in a good place where they can be well positioned for the challenges of shifting to value-based care. Nothing at their site has been on fire from an operational standpoint, and other than telling the IT team to stop building whatever people ask for, I haven’t had much interaction with them.
I stayed up late Saturday night working on a craft project (curse you, Pinterest), so I was awake when they called me in the wee hours of Sunday morning. It was the IT director. I could immediately tell he was in a panic. It took several minutes to calm him down. I was able to figure out that something had gone very, very wrong with their ICD code update.
Hospitals and providers have to update their codes every October 1 to make sure they have valid codes that can actually be sent out to billers. Most cloud-based vendors do the updates themselves and push it out to their clients, while non-cloud vendors that I have worked with provide a utility that allows the client to update their systems. Usually it’s no big deal, except for the vendors who are habitually late sending out their update packages and whose clients are cringing on September 30.
This particular client is on a non-cloud format and had planned to run the utility on their own. Although they had a solid plan with a lead resource and a backup resource, they never really anticipated having to use the backup resource. On the evening of the 30th, the lead resource became seriously ill and wasn’t able to do his duties. They decided to wait it out a day since they weren’t open on the weekend and see if he could handle it later in the weekend. When he was admitted to the hospital with appendicitis, it was clear that they would have to engage Plan B.
Although the backup resource had gone through the documentation, he had never run the utility or even seen it run. Apparently there was some confusion with a downtime playbook. Users were supposed to be dropped from the system before the backup cycle started and then were to be allowed back on the system after the code update was complete.
Somehow the users weren’t forced to exit and ended up being on the system while the backups started. Once the analyst realized users were still on the system, he attempted to halt the backups, but instead, the ICD update was started. I’m not sure what happened next, but the bottom line is that the database became unresponsive and no one was sure what was going on. To make matters worse, the fail-over process failed and they couldn’t connect to secondary/backup database either.
A couple of analysts had tried to work on it for a while and couldn’t get things moving, so they tried to reach the IT director, who didn’t answer. I can’t blame him since it was now somewhere near 1:00 a.m. After working their way through the department phone list, somehow I got the call. I’m not a DBA or an infrastructure expert, but I’ve been through enough disaster recovery situations to know how to keep a cool head and to work through the steps to figure out what happened. Since crossing to the IT dark side, I’ve had more late night phone calls for database disasters than I’ve had for patient care issues, but the steps are surprisingly similar.
Things were a bit worse than I expected since they couldn’t tell if the transaction logs had been going to the secondary database since we couldn’t connect to it. Even worse, I looked at the log of users who were on the system when it crashed and the senior medical director had been in, potentially documenting patient visits for the day. It took me at least 20 minutes to talk people down and get them calm before we could make a plan. The next several hours were spent working through various steps trying to get access to the secondary database to preserve patient safety. It was starting to look like a network switch might also have given up the ghost.
What surprised me the most was that they really didn’t have a disaster recovery plan. There were bits and pieces that had clearly been thought through, but other parts of the process were a blank canvas. Although there are plenty of clinical informatics professionals who are highly technical, it’s never a good sign when the physician consultant is calling the shots on your disaster recovery.
We engaged multiple vendors throughout the early morning as we continued troubleshooting issues. The IT director finally responded to our messages around 8:00 a.m. I realize it was Sunday morning, but he was supposed to be on call for issues due to the ICD code update and he frankly didn’t respond.
By 4:00 p.m. things were under control, with both the primary and recovery systems up and appearing healthy. My client created a fresh backup and decided to go ahead with the ICD code update. We weren’t sure how much of it had actually run given the aborted process from the night before. It appeared to be running OK initially, but after a while, it appeared that the process was hung. By this point, the team was stressed out and at the end of their proverbial ropes and there wasn’t any additional bench to draw from.
I finally persuaded them to contact the EHR vendor, thinking they would have had resources available since this was the prime weekend for ICD code updates even though my client was now more than a day late. It took several hours to get a resource to contact us back and then we had to work through the various tiers of support. Eventually midnight rolled around again and things still weren’t ready, increasing the anxiety as the team knew they’d have billing office users trying to access the system starting at 5:00 a.m.
Once we arrived at the correct vendor support tier (aka, someone who knew something), the team was run through checklist after checklist trying to figure out what was going on and whether we should continue to let it run or whether we should try to stop it.
The IT director finally made the decision at 6:00 a.m. that the practices should start the day on downtime procedures, and thank goodness they had a solid plan for that part of the disaster recovery game. The practices were given access to the secondary database in a read-only capacity for patient safety purposes and each site was said to have a “lockbox” with downtime forms. The group subscribes to a downtime solution that creates patient schedules, so they were quickly printed in the patient care locations along with key data for the patients who were already on the books for the day. Anyone who presented as a walk-in could be accessed through the secondary database.
At least on downtime procedures, users weren’t assigning any ICD codes to the patient charts since the utility hadn’t completed yet. It was restarted a couple of times and finally got its act together, completing around 4:00 p.m. Monday. After an hour or so of testing, we were able to let users back in the primary system to start catching up on critical data entry and billing.
Most of the day, though, was extremely stressful, not only for the IT team, but for everyone in the patient care trenches. It was also stressful for the patients since the group has a high level of patient portal adoption and there is no backup patient portal. Anyone who sent messages or refill requests or tried to pay their bills today was simply out of luck.
When an event like this hits your organization, all you want to do is just get through it. That’s not the hard part, though – the challenge is just beginning with the post-event review and attempts to determine the root cause of various breakdowns. It usually takes at least a couple of days to untangle everything and the work is not yet over. I’m happy to report that the analyst with the appendicitis did well in surgery and was discharged home before the EHR system was back online. I’m not sure having the primary analyst would have made a difference in this situation. I hope he continues to make a speedy recovery.
You never know when something like this is going to happen in your organization, and if you haven’t prepared for it or practiced you plan, you need to do so soon if not today. Similar to the practice of medicine, sometimes the most routine events can have significant complications.
Are you ready for a downtime? Is your disaster recovery plan solid? Email me.
Email Dr. Jayne.
Lab coats are unnecessary. Name tags are a good idea, and more professional. Hiking boots are okay, too.