The Great Wifi Outage of 2017

The Great Wifi Outage of 2017

Nicholas Reichert '20, Staff Writer

On Sunday, September 17, 2017, four of our servers failed overnight, causing all Salesianum students and faculty to lose wifi access for several days. On October 5th, I sat down with members of the IT Department to discuss the recent wifi issues. The interview explained the cause of the problem, how it was fixed, and what students can do to get the most out of their wifi. Edited for length and clarity, it can be seen below.

When did the wifi issues begin?

Mr. Reichert: They began late Sunday night [ September 17th ]. We started receiving emails, and Mr. Matarese received alerts that one server was having an issue. He came in at around 9:30 or so that evening and staying until 1, working and opening tickets with the company who manufactures that product and who we have licensing through. Coming back in at 7, we realized that it was affecting two other [servers].

Mr. Matarese: Yeah. To understand the problem you need to understand our infrastructure. We have one physical unit that comprises our server infrastructure. That is then divided up into four blades. Each blade has two processors, RAM, everything that you consider to be a regular computer. All this architecture is pushed into one unit, which allows for balance. Ultimately, two of those blades failed, and virtually all of our online activities failed along with the blades.

That night, did the failures affect all wifi in the school, or was it just a small issue?

Mr. Matarese: That night, the only problem people were having was authenticating email. … But later the next day, the thing that tipped us off to this being a big issue was our VPN services, which allows facilities to access resources off-site. They were not able to get through, which indicated that there was a bigger problem going on.

Was the problem limited to our internal servers, or was it a widespread issue?

Mr. Matarese: It happened on our side. There wasn’t another outside service that failed. But Office 365, for example, which is an outside service failed. It was due to  our system failing, not theirs.

Mr. Reichert: And that’s also true of student and teachers connecting to our wireless network. It’s all about how we ran access to that. So when you sign in with your password, it has to say “this is a valid password”, and that was what failed. Our access points were fine, and the same is true with email. There was no content lost, but anyone on the Salesianum domain was unable to send or recieve emails.

How and when were the problems resolved?

Mr. Reichert: There is a short-term fix and a long-term fix. The first service restored was on Tuesday night, when Mr. Matarese unfederated our domain. This changed the way we receive emails, but it allowed everyone’s to work correctly again. Some saw the fix instantaneously, and for others, it took a few hours. Then, the following morning, around 11:25 AM, everybody regained access to the wireless network. I went around and told all the teachers, and word spread pretty quickly. Mr. Matarese can explain how.

Mr. Matarese: So the reason we got wifi back that day was that there was a rule in our firewall that was trying to redirect traffic to a server that was affected by the outage. So in order to restore wifi, we directed the rule at a separate server that performed the same operations. It’s just that traffic was usually not routed that way. So by applying that change in how we are routing traffic, we were able to get it to the point where everybody was able to get from the “inside” to the “outside”.

In the future, could we encounter more situations like this one?

Mr. Matarese: So this goes along with what Mr. Reichert said about a “short-term fix”. We are up-and-running, but we had to bring in more equipment to get here.So what we’ve done is moved to a different solution. As I mention before, we have the one box with four servers in it. We are now in the process of replacing it with three separate servers, so that in the event of another server failure, the two extra will pick up the load.

One of the other things that we have is a “Next Day Support” Contract, which means that even if I am in here at 1 AM and I put in a ticket, by 10 AM the next morning, we will have a response from the company that makes the hardware. Whether that is they have a part shipped to us, or they will send a technician. To say that we will have a 100% on time response is impossible, but we will not have extended down-times, which is what we’re moving away from. Things may happen where there’s an outage here, and outage there, but the idea is to minimize that. It should be something we can fix in a matter of hours or minutes, rather than days.

Is there anything students can do to help minimize further issues?

Mr. Reichert: The biggest myth we’ve been fighting is that the wifi is bad. I made up the posters over the summer, and I would like to see the student follow those five steps before coming to us. The second thing is that students need to understand these basic things.You need to type in the right website and be connected to the right wifi; these basic things improve everyone’s connectivity. Students also need to understand the variety of variables, that wifi issues can be caused by anything.The biggest thing in our building is like when student go from, say, math to social studies. That student would usually remain connected to their math classroom, so when they get to their social studies class, they need to turn off their wifi, turn it back on, and renew the lease.Those are the types of 5-10 second fixes that eliminate the notion that the wifi is bad.The beauty is that we have over 100 access points throughout the building, and the company we use is one of the premier companies in the country, so there should be no reason that students cannot connect if they follow these steps.