A bit over a week ago, British Airways had to cancel hundreds of flights and stranded tens of thousands of passengers after an IT meltdown… or at least that’s how most characterized it. Many criticized British Airways for their communication following the incident, though after the fact I’d say they were quite generous to elite members, as they extended status by two years for those impacted by the situation.
British Airways claimed from the beginning that this incident occurred due to a “power issue,” which sounds similar to Delta’s meltdown last August, which also impacted tens of thousands of customers.
What’s interesting is how the CEO of IAG (the parent company of British Airways), Willie Walsh, is referring to the event. He claims that people aren’t fairly characterizing what happened. According to him, British Airways didn’t have a computer or IT meltdown, and it’s not fair to characterize it as such. Instead it was just a power issue. Per Air Transport World:
During a press conference at the IATA AGM in Cancun, Walsh, who is chairman of the IATA board of governors, was asked about the “computer meltdown” at BA and immediately responded, “It wasn’t a computer meltdown.” He suggested the media had mischaracterized the incident.
Power to a BA data center was improperly disconnected, he explained. The problem could have been solved “within a couple of hours,” Walsh said, but the power was restarted in an “uncontrolled” manner that caused serious damage to BA’s servers, resulting in all of BA’s systems shutting down and aircraft having to be grounded across its network.
“It was not a failure of IT,” Walsh said. “It was a failure of power.” He called the episode an “extremely rare event … There are incidents from time to time that are damaging to our reputation, but we recover from these.”
Now, I’ll be the first to admit that technology isn’t my strong suit. However, if a single power outage can shut down an entire airline, isn’t that an IT failure?
It's an IT failure. I say that, currently working in IT...and I'm sure my entire IT budget is less than what BA probably pays a single DBA.
At this office, IT is responsible for everything which might potentially affect IT, so power, generators, air conditioning, even roofing, falls in our lap. We've met with management and threshed out what is acceptable / non-acceptable downtime vs. expense and came to an agreement on what both...
It's an IT failure. I say that, currently working in IT...and I'm sure my entire IT budget is less than what BA probably pays a single DBA.
At this office, IT is responsible for everything which might potentially affect IT, so power, generators, air conditioning, even roofing, falls in our lap. We've met with management and threshed out what is acceptable / non-acceptable downtime vs. expense and came to an agreement on what both sides feel is reasonable. Even when the power grid blew up last year, setting an adjacent building on fire and damaging switchgear, our racks kept on running. We've lost a few servers, but usually are back up to full capacity within 10 minutes, which is the design. Then again, we are running 20 year old servers (no joke) in some cases.
In a previous life, I did broadcast engineering. It was pounded into me at my first broadcast job that "everything MUST run 24/7, even during the middle of a Category 5 hurricane." That was our edict, and we were given the appropriate funds to make it happen. I say appropriate, as in just enough. We really didn't go hog-wild with this. About the most extravagant thing we ordered were stainless-steel flood barriers for the doors & windows on the first floor. We never got to test it to a Cat 5, but the facilities withstood multiple Cat 3 / Cat 4 hurricanes without missing a beat, and the flood barriers proved to be a wise expenditure.
I say this to prove that it *can* be done and done on the cheap. BUT you have to design it this way from the ground-up.
I also want to say that no outsourced firm will take care of you as well as your own in-house staff will. The outsourced firms may indeed have more experienced people and more resources at-hand, but in an emergency, you're just another customer. The outsourced helpdesk operator has no skin in the game. It's not THEIR business and THEIR data on the line.
C-Level idiots! If you have ever worked on servers like Dell, you know that there is a configuration setting that will automatically restart the server in the event that electrical power becomes available. This is not the default setting. The default setting is the "on/off" button on the server has to be pushed. So, someone in British Airways IT department deliberately configured their servers to restart automatically when power was restored. They did this configuration...
C-Level idiots! If you have ever worked on servers like Dell, you know that there is a configuration setting that will automatically restart the server in the event that electrical power becomes available. This is not the default setting. The default setting is the "on/off" button on the server has to be pushed. So, someone in British Airways IT department deliberately configured their servers to restart automatically when power was restored. They did this configuration as a policy and over years. I am not blaming BA IT for that one. It is standard practice in the "computer-biz".
The servers were probably not damaged. Some databases likely took a hit. What probably did not come back correctly when power was restored was the network routing. The switching / routing certainly became very confused about was and was not online as power was being restored.
Now, here is the fundamental cause. To test system failover, you have to invest a lot of money in duplicate systems and personal redundancy. After 30 years running IT systems, I have yet to see C-Level people spend that kind of money in any major company. Couple that with IT's tendency test for "success" instead of "failure", you end up with major system failures and C-Level idiots running their mouths off about why it was not their fault.
How BA characterizes the event may have more to do with indemnity claims they've asserted against the power company or their business interruption insurers than claims asserted by passengers against BA. If that's the case, you'd certainly want the CEO to be consistent with his public statements.
It's worth noting that an airlines systems are slightly more complex than wacking a $50 UPS on your desktop at home.
From what i've read on various tech sites it would appear that the cause was a data centre employee trying to over ride an automated UPS / Generator cut over. The result being double the standard voltage was likely sent down the line to the customer (BA) equipment, causing physical hardware damage.
The real...
It's worth noting that an airlines systems are slightly more complex than wacking a $50 UPS on your desktop at home.
From what i've read on various tech sites it would appear that the cause was a data centre employee trying to over ride an automated UPS / Generator cut over. The result being double the standard voltage was likely sent down the line to the customer (BA) equipment, causing physical hardware damage.
The real issue is the lack of redunduncy. BA supposedly have 2 additional data centres, and you would expect at least one of them to be a live DR site ready to spin up.
One of the other interesting facts was some comments around the complexities of BA's systems. Rather than a somewhat simplified unified system, BA reportedly had upwards of 250 different systems integrating together to perform different functions.
I find it difficult to believe that it was a simple power failure. They have back up power in place. Just about every aspect of their operation does. Accidentally pulling the plug from it's socket is pretty much the most basic simple disaster recovery scenario that is nearly impossible to overlook. So no, I don't buy the ceo's response. I'm pretty sure however that with his track record of saving money and outsourcing he cannot...
I find it difficult to believe that it was a simple power failure. They have back up power in place. Just about every aspect of their operation does. Accidentally pulling the plug from it's socket is pretty much the most basic simple disaster recovery scenario that is nearly impossible to overlook. So no, I don't buy the ceo's response. I'm pretty sure however that with his track record of saving money and outsourcing he cannot continue to advocate outsourcing as a brilliant business move if these sorts of catastrophe's occur. So naturally, do some alt-facts and say no it wasn't outsourcing, someone just pulled the plug and none of our contingency systems kicked in....yeah right.
It's management and IT failure.
A properly designed data center has first battery backup for instant recovery in power loss to power the entire data center for minutes, then one or more generators (diesel typicallly) to generate the necessary power for the entire data center for a period of time (typically 6-8 hours, with refuelable tanks, though that depends on how perionoid the entity is). If he's saying that they had none of that...
It's management and IT failure.
A properly designed data center has first battery backup for instant recovery in power loss to power the entire data center for minutes, then one or more generators (diesel typicallly) to generate the necessary power for the entire data center for a period of time (typically 6-8 hours, with refuelable tanks, though that depends on how perionoid the entity is). If he's saying that they had none of that in place (which from his words it does). Then that's a failure of management probably not listening to IT.
You are totally correct; power should be more decentralized and backup generators should be in place in order to prevent such an issue from happening. I have a small IT background and even in my own home I have systems in place to prevent critical documents from being lost in the event of a power outage. Ever hear of an uninterruptible power supply?
He's right - it's not an IT failure. It's a management failure.
Failing over to a disaster recovery datacenter would have prevented the whole outage. Evidently BA haven't invested in backup facilities.
In plain English, simply, Alex Cruz cut corners and risked lives. He's a failed CEO and should be replaced with someone who at least has the desire to run a world-class airline. Oust Walsh, too. Splitting hairs after this sort of failure is unbecoming.
well said @troy We exercise/simulate disasters in our data centers. We actually pull the plug, yes, pull the plug on servers just like someone cut the power line to building and backup power is not available. Or we yank, yes, yank disk drives out, by their handles, from disk arrays. We monitor automatic fail-over to other server locations if we pull plug on server and monitor if the disk array continues functioning, even with less...
well said @troy We exercise/simulate disasters in our data centers. We actually pull the plug, yes, pull the plug on servers just like someone cut the power line to building and backup power is not available. Or we yank, yes, yank disk drives out, by their handles, from disk arrays. We monitor automatic fail-over to other server locations if we pull plug on server and monitor if the disk array continues functioning, even with less internal redundancy, due to missing drive.
Fail-over technology, being a server, being disk arrays, are fascinating but pretty old stuff. It is the IT department to protect customer transactions and interactions from "external" to IT disasters, being human stupidity, flood, fire, earthquake. End of story.
Lol. A typical response to spin the (already) circulated assumption. Probably to shift the issue from a "beyond reasonable doubt" position into "a doubtful cause" one. Maybe he is trying to haggle some clause or maybe for insurance...
My initial reaction was you have to be joking of course it is.
Thinking it through though, maybe not so much an IT failure as a business continuity failure.
Power is somewhat of an external factor to IT, we can take steps to prevent it being an issue but ultimately shit happens!
There needs to be steps in place to work through them.
The closest analogy would be something like a bird strike shutting down an engine. Was it an Engine problem or a bird problem?
So, it doesn't sound like the "untrolled event" that they claimed allowed them to escape EU compensation rules. BA should be paying the full 600 euros in compensation owed to passengers, and this proves it.
A data center of that size would be seriously hindered in the event of a hard shutdown. Servers need to be shutdown gracefully and sequentially. Likewise, when power comes back up, the servers need to be brought back online gracefully and sequentially. If this sequence is disrupted, a data center of that size will be out of whack for days.
Normally, they'll have an (or many) Uninterruptible Power Supply units, so that in the event...
A data center of that size would be seriously hindered in the event of a hard shutdown. Servers need to be shutdown gracefully and sequentially. Likewise, when power comes back up, the servers need to be brought back online gracefully and sequentially. If this sequence is disrupted, a data center of that size will be out of whack for days.
Normally, they'll have an (or many) Uninterruptible Power Supply units, so that in the event of a "power failure", they'll have the time and power to get the center shut down gracefully. Then once normal power resumes, they can come back online gracefully. It's this second part that I'd consider an IT failure.
I wonder if the European passenger rights legislation has been challenged over a (third party) power failure and airlines not paying compensation.
Semantics.
If your Information Technology (IT) organization can't support mission critical services during a temporary power failure, and worse, can't properly return to normal function when the power is restored without causing damage to infrastructure then yeah, I would call that a IT failure. A massive one.
My sense is he is trying to split hairs and saying it wasn't code issue with an update gone bad or that they weren't hacked. That...
Semantics.
If your Information Technology (IT) organization can't support mission critical services during a temporary power failure, and worse, can't properly return to normal function when the power is restored without causing damage to infrastructure then yeah, I would call that a IT failure. A massive one.
My sense is he is trying to split hairs and saying it wasn't code issue with an update gone bad or that they weren't hacked. That may be true but they certainly seem to be trying to obfuscate the actual cause for some reason.