How the CrowdStrike outage affected the University of Houston System

Crowd Strike

The largest IT outage in history last July crashed computers, canceled flights and disrupted hospitals all around the globe but how did it affect the University of Houston System?

CrowdStrike, a global IT security company, had a single automated software update that brought global economies to a sudden halt revealing the world’s dependence on cybersecurity.

Many Fortune 500 companies and universities use CrowdStrike’s cybersecurity software to detect and block hacking threats. But when CrowdStrike issued an update last week to its signature cybersecurity software, known as Falcon, millions of computers around the world running Microsoft Windows crashed because of the way that the update interacted with Windows.

The ‘Blue Screen of Death’ that many people reported seeing on their machines, and that only a manual intervention to delete the bad file could fix a slow, painstaking process when you consider that many UH devices needed to be reset this way.

Eric Mims, the Director, Enterprise IT Security & ISO, was asked to report how the UH Systemhandled the outage crisis.

When did you first discover it affected our system? 
We discovered the event as it was happening worldwide. Shane Vaz was still awake and started getting notifications of problems early on. He was online and researching the issue by 1:00am Friday morning. Shortly afterwards, Will Moon joined him, and they began evaluating how widespread the problem was at UH and the UH-System. An emergency conference bridge was started by ITAC sometime after that and when it was clear to ES and TSS that the problem was CrowdStrike, I was pulled in around 3:45am. At that time, we didn’t have an exact number of affected systems but knew that some areas might have more than half of their systems affected.

How did our UH security team go into action? 
First, we confirmed the problem was with CrowdStrike. That was easy since CrowdStrike themselves were open and upfront about where the problem was and how to fix it. We then reviewed both what CrowdStrike was telling us with what others were seeing and doing. The team probably spent equal time between CS’s site and 3rd party sites to confirm the published fixes were working and what issues other people were seeing. As the morning went on, Shane focused on identifying potentially affected systems while Will made sure the appropriate college and division ISO’s had current information. Our hope was to be providing the information needed for critical systems, the large number of areas affected, and, obviously, those that might be needed during Friday classes.

The nature of the problem meant we had a possible fix early on by simply rebooting the box. Reality, of course, meant that in some cases, you had to reboot many times and even then, it didn’t always work. This was one of the reasons we learned later that it had a greater chance of success if you used a wired connection instead of wireless.  Fortunately, we had a second fix, a manual process which worked well. All of this happened so fast that when I joined the conversations at 3:45 in the morning, most of this was known, if not fully understood, and teams across the system were working on getting our Windows boxes updated.

How long did it take to correct the Microsoft problem? 
First, I have to say that we were fortunate this occurred between semesters and on a Friday. Fewer people using the environment helped our response time.  As to how long? At UH I think we had all critical systems, if not individual servers, up and running by 5am. Some colleges and divisions took longer but I believe we were, with minor exceptions, functional at 8am and fully operational by the end of the day Friday. 

Huge complements to UIT’s ES for their awesome response. They had the primary servers up and running before people made it into their office for what ended up being, for many, a normal Friday workday. For others, because of the complexity of the different environments, the updates took longer. But for the most part, we were up and running from an infrastructure point of view by the time people started showing up for work, and mostly fixed with the workstations by the end of the day.   

Was the UH system affected financially? 
Absolutely. I’m not able to put an actual dollar amount on the cost. We must consider that many planned projects or regular work was put on hold for a part, or even all that Friday. In addition to the monetary toll, there was a mental and physical toll on employees who were on this from the start, and didn’t stop helping people and reporting progress until we sent them to off to bed after working 16+ hours with no breaks.

I do want to add that based on what I’m seeing in the news, my team’s ability to quickly disseminate information and the skill and flexibility of the various IT departments across the system turned what could have been a nightmare event into a single day of scrambling. 

Some companies are still making corrections – is UH?  
Making corrections? No. For us, this is over. There may be some “missed” computers out there, especially if we’ve got faculty who are traveling or out of the office until next fall, or a missed lab computer somewhere. But we have reports we’ve made available of any box that might have been affected and we’ve shared those around. From a system point of view, this is a closed case. 

The reason we could call it closed so quickly is, as I mentioned earlier, our IT departments across the system. They took the info we provided or was provided by CrowdStrike, they reviewed it carefully, then they got to work. Those areas that followed the instructions were back up and running quickly. ES had something like 50 servers affected at around 4 in the morning. By 8am, they weren’t an issue anymore because of ES’s talented people jumping on this.

As far as other companies go, I don’t mean to sound derogatory but some of what I’ve heard does make me wonder. Understand that in no way does this absolve CrowdStrike from a serious blunder for which they are fully responsible.But when some companies are bouncing back in a few hours and others are still struggling, I think this is more a commentary on the individual IT departments than it is on CrowdStrike.  I know I have an awesome team, and Friday proved the talent of IT across the System.