Last week my ISP had a state wide outage. Is it really true that ALL of this
ISP's internet in my state goes through ONE and only one router. Shouldn't
there be redundant routers in case one fails?
Naturally, the actual question mentioned a specific ISP, but I reworded to
protect the innocent.
And yes, innocent, I do believe they were. It's certainly possible, I'd truly
be shocked if any major ISP - and even a few not-so-major ISPs - were silly
enough to put that large a customer base behind a single point of failure.
Heck, I'd be shocked if there were a router powerful enough to single-handedly
handle all the ISPs traffic for a single U.S. state.
However, there are other single points of failure that are much more common,
even though they shouldn't be, and much more vulnerable than you might
think.
I'll put it this way: never underestimate the power of a backhoe.
]]>
You are very correct in that your ISP should be focusing on what's called "redundancy" to ensure that some equipment can break without causing the entire operation to shut down.
At home, our computers all have a single connection to a single router, which in turn has a single connection to our broadband connection - typically a single phone line or cable - which then connects to a single ISP.
Good ISP and data center operators, on the other hand, shudder at that sentence every time they see the word "single" - each represents a single point of failure. If your single router dies then you have no connection to the internet until you replace it or come up with an alternate solution. ISP's don't have the luxury to wait "until" anything - they need to be able to handle equipment failure as transparently as possible, because as you've seen - if an ISP has a large scale outage then thousands are affected.
So they typically use at least two. Of everything.
(Two caveats before I go on: 1) I'm grossly over-simplifying; the actual mechanics of this is sometimes as simple, but often much more complex than I'm outlining here. And 2) I'm assuming a good ISP. This has nothing to do with size or notoriety, and it's almost impossible to get real information about, but there is overhead and cost associated with redundancy. As a result, some ISPs will make the risk/cost/benefit tradeoffs differently in different areas, sometimes by eliminating some of the redundancy.)
-
data centers have multiple power sources, uninterruptible power supplies and backup power
-
computers have multiple drives in RAID arrays
-
multiple computers handle critical tasks
-
computers are connected to multiple often identical networks
-
networks are often connected to multiple routers
-
routers are often connected to multiple internet connections (large data centers often have multiple connections to the internet from different ISPs)
The concept behind all of this redundancy is very simple: any "one" of just about anything can break and "the other" picks up the load and carry's on while the broken component is repaired or replaced.
So, what happened when your ISP failed?
I don't know, but I'll hazard a guess, since it's one of the most common causes for wide-spread outages.
Remember that backhoe I mentioned?
I'm guessing that somewhere an important cable was cut. Or even two.
Redundant connections to the internet are nice and all that, but if they then both exit your ISP's data center at the same point, and run down underneath the same street next to each other for a while, a single swipe of the backhoe can take out both at once.
Don't laugh - it's happened more often than most would be willing to admit.
I know that Rackspace - the hosting provider at which the Ask Leo! servers are housed - takes care to ensure that their multiple internet connections actually leave from opposite ends of their data center buildings, and then take separate routes as they go to wherever they next connect.
So, an inopportune cable cut is my first guess.
Also, true redundancy isn't easy - it's easy to overlook single points of failure.
It's also easy for a redundancy or backup plan not to work when it's actually called into service. Perhaps in your case only one of a pair of redundant internet connections might actually have been cut, but when the load of that entire ISP was shifted to the remaining internet connection it simply couldn't handle it. The result is that the first failure "cascades" into the second, shutting down the entire connection.
The good news is that for the most part ISPs and data centers get this stuff.
The bad news is that, yes, as you can probably see, it's quite difficult.
To my knowledge at this time the entire state of Arkansas has 1, one, as in a single fiber optic line that runs through the state from the East.
It is possible that this has changed in the last year or so but I haven’t heard anything about it though.
Single points of failure are a fact and are a nightmare for emergency planers and network admins that try to plan ways around them.
Many years ago a backhoe was operating in front of my office building.
He hit a trunk telephone line!
Immediately TWELVE telephone trucks appeared and stayed for a couple of days while the cable was fixed.
There were a LOT of business that were out for a time.
BACKHOES PREVAIL.
A good friend of mine is a 5th generation family business owner. They’ve been building bridges since the late 1800’s. They recently won a case against –BIG-NAME phone company– because my friend’s company dug up fibre lines and this phone company accused them of digging in the wrong place. My friend happened to be at the site at the time of the “oops” and it’s just like Leo said, the fiber lines got dug up by a backhoe…all of them (I think there were dozens..but I might be mistaken). I do know they were all in one big bundle. Fortunately they were dark fiber, so no one was affected. However, the reality was this, –BIG NAME phone company– had the lines marked incorrectly on the engineers map. The surveyors did their part, they found where the lines were suppose to be, flagged the area, and the backhoe did NOT dig there. Ultimately, it was proven that the phone companies didn’t create the maps correctly. To make a long story even longer, if those were live fibre cables, the entire area could have been without internet/phone and it’s all because a map was marked incorrectly by just a FEW HUNDRED FEET and all the fibre in that area was in ONE LOCATION! Of course, they might have another redundant bundle somewhere else that didn’t pertain to my friend’s court case, but that doesn’t make for a very exciting story, now does it. :-P
While installing an electrical service to an underground sewer lift station, the contractor responsible for setting the station struck a fiber bundle that was either not located or was incorrectly located by the required locating service. State electrical code requires a single #12 copper conductor to be buried a few inches above the fiber optic cable to make locating the FO cable much easier and accurate. The phone company presented him with a bill for $1.6 million dollars.