By Yogi Chandiramani, VP of Solutions Engineering, EMEA at ThousandEyes.
Internet banking has transformed the way consumers engage with their financial providers and spawned a new breed of financial services (FS) companies that are reimagining the banking and payments industry. Internet outages are disruptive to any business but for financial services, where trust is pivotal, financial losses and reputational damage can be particularly severe. In fact, the average cost of downtime for FS organisations is up to £4,400 per minute.
With a business model reliant on the Internet, the task of monitoring and managing outage events moves beyond the confines of the FS firm’s network, where visibility is easily achievable, to external cloud and Internet networks. For many FS organisations, however, the Internet is a “black box” and when disruptive events occur, IT and digital operations teams are often unable to identify the source or respond effectively. By knowing how outages can occur and where, FS businesses can better arm themselves and benefit from significantly reduced resolution times and better customer communication. Here are some of the most common causes of large-scale outages to look out for.\
When Internet Service Providers Fail
The Internet is made up of thousands of autonomous networks that are interdependent on one another. A frequent cause of outages are derived from infrastructure failures among Internet Service Providers (ISPs). ISPs provide transport of Internet traffic on behalf of individuals, companies and other ISPs. An infrastructure outage caused by a faulty router or fibre cut can impact a FS business’ ability to connect to services – even if they don’t have a direct relationship with an affected ISP. In preparation for when an ISP outage occurs, FS firms need to know what networks their traffic is touching, which requires modern monitoring options beyond the confines of their own four digital walls.
The Power and Complexity of the Cloud
As both traditional banks and new fintechs continue to harness the power and agility of cloud services, migration also introduces new vulnerabilities. Moving critical applications and services to the cloud means IT teams no longer have to worry about building and maintaining infrastructure, however, at the same time, cloud computing introduces heightened unpredictability due to the complexity of Internet and cloud connectivity.
That said, most cloud vendors have redundancy measures in place to mitigate the impact of outages on customers. For example, in these past few months during the coronavirus pandemic, cloud providers have fared well in response to increased Internet usage, recording fewer outages compared to ISPs. However, downtime caused by a cloud outage can still have a significant impact and on a large scale. Last year, for example, Google Cloud suffered a 4-hour outage that impacted shopping services like Shopify. For FS businesses, mitigating this risk involves ensuring that cloud architecture has sufficient resilience measures – whether on a multi-regional or multi-cloud basis.
The Risk of Malicious Intents
For reasons including hacktivism and commercial competition, a common type of reachability outage is caused by a Distributed Denial-of-Service (DDoS) attack. This is when hackers deliberately take a service offline or deny legitimate users access to a service by overwhelming it with a large number of requests simultaneously. According to recent research, FS organisations have experienced a significant increase in DDoS attacks over the past three years, with DDoS being the second biggest threat behind credential stuffing. While DDoS events are an unfortunate reality of operating on the Internet, it’s important to have visibility into the scope, impact and behaviour of these events and be able to validate that DDoS mitigation steps are effective.
Two other deliberate reachability outage types include Domain Name System (DNS) hijacking and cache poisoning. DNS works as the phonebook of the Internet, translating IP addresses to ensure that web traffic reaches the intended domain, such as a banking web page. DNS hijacking and cache poisoning share a similar end goal – to redirect users to draw traffic away from legitimate servers and over to a fake one in a cybercriminal’s control. Both types of attacks can easily lead to users unintentionally installing malware or giving up their personal and financial details into a hacker’s website, thinking it’s a legitimate one. To keep abreast of any malicious attempts, FS businesses can employ DNS server and trace tests to continuously monitor the state and availability of their DNS records.
Hijacking Internet Traffic
The border gateway protocol (BGP) is a critical component of Internet traffic. When a user wants to access a website or service, their traffic flows from “Point A” to “Point B” through a chain of service providers and third-party vendors that route the traffic to its intended destination. Fundamentally, the order of the transit is built on trust between the entities involved and in a BGP route hijack scenario, a malicious actor takes advantage of this trust-based system by diverting traffic from an intended destination to an illegitimate one. In 2018, the popular crypto wallet app, MyEtherWallet, was attacked by means of a BGP hijack, resulting in hackers gaining access to users’ accounts. BGP incidents are still on the rise and any company offering online digital services need to ensure proper monitoring processes to ensure traffic reachability.
An Ongoing Constant: Human Error
Lastly, it’s important to acknowledge that sometimes even human error can cause major outages across networks and applications. An internal mistake, like inadvertently taking servers offline, can manifest as network packet loss or lack of service availability for end users. These instances serve as a reminder that no matter how much automation is in place, there is always a chance of operator error.
The unpredictableness of the Internet means that outages of all kinds are inevitable.
As cloud migration and digital transformation continue to change the way that FS companies do business, delivering a best-in-class user experience will be dependent on multiple third parties as well as internal systems. For FS organisations, resolving dreaded downtime issues as quickly as possible means implementing network monitoring systems to inform recovery plans and provide a holistic view of Internet performance as well as dependencies across websites, applications, and services. Having an adequate level of visibility allows IT teams to focus on fixing the outage, rather than spending valuable time trying to locate where – across the vastness of today’s modern networks – it took place to begin with.