What happened?

On Tuesday June 8th, at around 10am GMT, dozens of the biggest websites in the world all went down at once. The outage affected a huge variety of sites, from PayPal to Shopify, Vimeo to HBO Max, and even the UK Government website as well as the prestigious New York Times. Sites were down for up to an hour as software engineers at the cloud computing provider, Fastly, worked furiously trying to identify and fix the problem.

What caused the outage?

Despite the global scale of the issue, this was not an organised cyber-attack or a major system meltdown. In fact, the problem was traced back to a single individual who had simply changed their Fastly settings. This innocent act exposed a bug in the latest version of their software, causing over 85% of the Fastly network to show errors.

A technical glitch

The problem arose because of the way the Fastly network is constructed. Fastly operates an edge cloud approach, which aims to reduce transmission times for data in their network. This includes what are known as points of presence (POPs), comprising localised distribution points for content, which are located nearer to the end user. It was this configuration which fell foul of the computer bug, causing worldwide issues.

All fixed fast

Fortunately, the Fastly engineers were quick to react and had 95% of their network operating as normal in just 49 minutes. While this may be impressive, Fastly were still embarrassed and concerned that the outage had even occurred in the first place. They promised a post mortem to figure out why they didn’t detect the bug during our software quality assurance and testing processes.

A bigger issue

Under an hour offline may not seem that important to most people. Having to delay your online shopping, or not being able to use the Twitter emojis, as was the case during the blackout, is inconvenient but hardly catastrophic. For some companies, however, even a short time offline can be very costly, with analysts estimating that the larger businesses may have lost up to $250,000. This could leave Fastly facing significant legal claims from customers who were let down by their services.

The bigger picture

The problem is that one user bringing down huge swathes of the internet is more than just a curious story. It speaks to the way that the whole of the web is structured, and how that structure leaves us all vulnerable to a repeat of this outage, and perhaps much worse. As with so many areas of business these days, the internet and its infrastructure are controlled by just a handful of companies. This not only leaves it vulnerable to failures like the Fastly incident, but it also means that those failures are having a wider and wider impact.

Increasing vulnerability

The advent of cloud computing, where major organisations store and process data remotely, via the internet, makes the whole ecosystem larger every year. Complex architecture, such as edge cloud computing and POP distribution, may make our connections faster, but the more complex a system, the more there is to go wrong. Perhaps outages are something we just have to get used to. They may simply be the price we pay for superfast, always-on internet connection to the world. But that doesn’t make it any easier when your favourite website disappears for an hour for no apparent reason.