The IT failure that caused BA to grind to a halt over the May 2017 bank holiday weekend in the UK was, in some ways, a perfect storm.
A power-cut apparently caused the airline's IT systems to shut down at the start of one of the busiest travel weekends of the year in the UK and not only did BA's disaster recovery plan seemingly fail to kick-in but the issues were compounded by poor crisis management, customer handling and PR on the part of the airline.
But what's the best way to deal with such high profile IT failure? The answer to that question depends on who you ask and what their priorities are.
In my experience, any organisation in that situation is likely to be faced with an immediate set of competing and conflicting interests. The board, shareholders and PR team will presumably be demanding that the problem must be fixed, and as quickly as possible, almost regardless of the cost and difficulty of doing so. The CIO's team will likely bear the immediate strain of problem identification and rectification – but it may be that they don't have either the skills to rectify issues with potentially very old technology nor the bandwidth to do so while at the same time running the rest of the IT systems that are unaffected.
One key aspect is that the CIO's team needs support from the entire senior management team and not to be scapegoated or left to face the wrath of customers alone.
Fixing any IT failure is unlikely to be as simple as merely flipping a switch and engaging a parallel hot standby system. Depending upon the type of error and the organisation's IT architecture, the CIO's team may need to deal with multiple external organisations, some of which – as apparently in BA's case – have been outsourced to offshore locations and those external organisations will all be thinking less about the affected end-users or customers than about ensuring that the ultimate blame (and legal compensation claims) don't end up at their doors.
Clearly, the affected organisation won't immediately rush to check the terms of their IT supply chain contracts (although it usually doesn't take long for that to happen), but they need to incentivise any external IT dependencies to respond quickly and with a goal of restoring the affected IT system as quickly as possible.
Externally, the pressure will be on any customer-facing organisation with an IT failure to explain as clearly as possible what has caused the problem. But that may not be at all clear: the complexity of most IT systems make it very difficult to pin down an exact cause.
So in the immediate aftermath of a failure, the pressure will be on to both rectify the situation and to investigate the cause. One might assume that the obvious way to do this is simply to use the existing internal IT team. However, that team may not necessarily have the requisite skills and, secondly, may themselves be conflicted at wanting to deflect the cause away from deficiencies in the past IT organisation. It's human nature to want not to accept blame for a failure. So, in my experience many organisations will seek to strike a balance between use of the existing team and importing additional, impartial external resources.
The in-house lawyers will probably want to make certain that external resources are engaged by the legal team rather than necessarily directly by the IT team – because any findings that are unearthed or reports produced would then benefit from legal privilege. I can think of a number of cases where organisations have engaged external experts to help respond to an IT failure and produce a report for the board but the report has ended up being unusable at best (and, at worst, damaging) because, although it identifies external failures which could have grounded a compensation claim against a third-party vendor, the report also contains lots of damaging admissions about the client's own IT or administrative failures.
Striking a balance between rapid response, rectification and investigation is important. Boards also need to focus on who are the best people to do so. Clearly the response needs to be led by the CIO and IT team but, more than ever, the CIO needs support from senior levels of the board. The investigation and response itself needs to be treated as a standalone business risk project.
Any organisation suffering an IT failure needs to find a way to restore both the IT systems and its customers' confidence – and, crucially, that needs to happen quickly. A fundamental issue with many IT failure response projects is that a fix is found quickly and applied but then the aftershocks rumble on for years. One only has to think of the NHS Connecting for Health IT project which was one of the most poorly executed IT projects in the UK – and in respect of which the legal consequences continue to be dealt many years after the NHS wisely pulled the plug on the project itself.