Naturally, the actual question mentioned a specific ISP, but I reworded to
protect the innocent.
And yes, innocent, I do believe they were. It’s certainly possible, I’d truly
be shocked if any major ISP – and even a few not-so-major ISPs – were silly
enough to put that large a customer base behind a single point of failure.
Heck, I’d be shocked if there were a router powerful enough to single-handedly
handle all the ISPs traffic for a single U.S. state.
However, there are other single points of failure that are much more common,
even though they shouldn’t be, and much more vulnerable than you might
I’ll put it this way: never underestimate the power of a backhoe.
You are very correct in that your ISP should be focusing on what’s called “redundancy” to ensure that some equipment can break without causing the entire operation to shut down.
At home, our computers all have a single connection to a single router, which in turn has a single connection to our broadband connection – typically a single phone line or cable – which then connects to a single ISP.
Good ISP and data center operators, on the other hand, shudder at that sentence every time they see the word “single” – each represents a single point of failure. If your single router dies then you have no connection to the internet until you replace it or come up with an alternate solution. ISP’s don’t have the luxury to wait “until” anything – they need to be able to handle equipment failure as transparently as possible, because as you’ve seen – if an ISP has a large scale outage then thousands are affected.
So they typically use at least two. Of everything.
(Two caveats before I go on: 1) I’m grossly over-simplifying; the actual mechanics of this is sometimes as simple, but often much more complex than I’m outlining here. And 2) I’m assuming a good ISP. This has nothing to do with size or notoriety, and it’s almost impossible to get real information about, but there is overhead and cost associated with redundancy. As a result, some ISPs will make the risk/cost/benefit tradeoffs differently in different areas, sometimes by eliminating some of the redundancy.)
data centers have multiple power sources, uninterruptible power supplies and backup power
computers have multiple drives in RAID arrays
multiple computers handle critical tasks
computers are connected to multiple often identical networks
networks are often connected to multiple routers
routers are often connected to multiple internet connections (large data centers often have multiple connections to the internet from different ISPs)
The concept behind all of this redundancy is very simple: any “one” of just about anything can break and “the other” picks up the load and carry’s on while the broken component is repaired or replaced.
So, what happened when your ISP failed?
I don’t know, but I’ll hazard a guess, since it’s one of the most common causes for wide-spread outages.
Remember that backhoe I mentioned?
I’m guessing that somewhere an important cable was cut. Or even two.
Redundant connections to the internet are nice and all that, but if they then both exit your ISP’s data center at the same point, and run down underneath the same street next to each other for a while, a single swipe of the backhoe can take out both at once.
Don’t laugh – it’s happened more often than most would be willing to admit.
I know that Rackspace – the hosting provider at which the Ask Leo! servers are housed – takes care to ensure that their multiple internet connections actually leave from opposite ends of their data center buildings, and then take separate routes as they go to wherever they next connect.
So, an inopportune cable cut is my first guess.
Also, true redundancy isn’t easy – it’s easy to overlook single points of failure.
It’s also easy for a redundancy or backup plan not to work when it’s actually called into service. Perhaps in your case only one of a pair of redundant internet connections might actually have been cut, but when the load of that entire ISP was shifted to the remaining internet connection it simply couldn’t handle it. The result is that the first failure “cascades” into the second, shutting down the entire connection.
The good news is that for the most part ISPs and data centers get this stuff.
The bad news is that, yes, as you can probably see, it’s quite difficult.