How Can One Service Take Down So Much?

The cloud: you’re soaking in it.

What happens when a major cloud service goes down, and what's the takeaway?

Every once in a while, many websites — including Ask Leo! — suffer failures because of outages at one of the internet’s major cloud service providers: Amazon.

These outages highlight important aspects of the modern internet and raise a few questions along the way.

Become a Patron of Ask Leo! and go ad-free!

Cloud service failures

Most of the internet is hosted on computers in data centers belonging not to the services you and I use, but to hosting companies. “Other people’s computers” is the very definition of “the cloud”. It’s important to understand the benefits (mostly cost, scalability, and reliability) and risks (such as unexpected failures) and plan accordingly.

Amazon Web Services

Amazon Web Services, or more commonly AWS, is the cloud-service arm of Amazon.com.

Years ago, as Amazon was ramping up its retail efforts, they realized they could rent the computing infrastructure they had created for themselves to others. That became today’s AWS, now arguably the world’s largest cloud-service provider.

AWS provides a wide and complex range of services. The simplest — S3, for Simple Storage Service — is nothing more than online storage. Another relatively simple example is EC2, for Elastic Compute Cloud, AWS’s virtual server product.

AWS also includes domain registration and DNS services, database offerings, email and messaging products, and a wide variety of more esoteric services.

All of it runs in Amazon’s data centers on Amazon’s hardware. Customers pay based on how much they use a service or mix of services.

Benefits

I’ll oversimplify by boiling everything down to two primary benefits: cost and scalability.

Again, using a cloud service is simply using someone else’s computer. Rather than purchasing (or renting) an actual computer to act as a server in a data center somewhere, AWS customers simply create an EC2 virtual computer as needed on Amazon’s hardware. Rather than buying extra hard disks for that computer, they use Amazon’s S3 storage 1 as needed.

It’s often significantly cheaper to purchase services from a cloud-service provider than it is to purchase, manage, inventory, connect, repair, replace, maintain, and otherwise deal with all the issues around having your own hardware. It’s someone else’s computer, and they’re responsible for all those pesky details, allowing you to focus on your website, application, or service.

One of the lesser-known aspects of cloud computing is the ability to respond quickly to changing needs.

Ask Leo! is currently running on an AWS EC2 virtual server. The server is a software simulation of a “smallish” computer running along with several other such simulations on a “really big” computer. If I decide I need my server to be a little less “smallish” — as I have, at times — I can change a couple of settings, reboot the server, and within a few minutes have a new, more powerful server without needing to reconfigure any of the software on the machine and without having to deal with any hardware.

Imagine if that could happen in real time.

Properly written applications in Amazon’s AWS infrastructure can do exactly that: respond to usage changes by transparently adding or removing resources as needed. This is a really big deal, and is a common reason very large applications or services host on AWS. They only need — and pay for — a few resources when they’re not busy, and add more as needed. Doing this with your own dedicated hardware is next to impossible and generally impractical, since you’re paying for your hardware whether or not you’re using it.

There are other benefits, but those are the two biggest.

The third, perhaps contrary to occasional experience, is significantly improved reliability. To discuss that, though, we need to discuss risks.

Risks

The biggest risk most people think of is that your service is running on someone else’s computers in someone else’s data center, and thus your fortunes are in someone else’s hands. If they suffer an outage, you suffer along with them. When Amazon AWS has a service-wide problem, all services relying on it have a problem.

It is a risk. But the more persuasive counterargument is that actual noticeable downtime for large cloud services like AWS is very rare. When something happens, all guns are brought to bear on the problem until it’s resolved. We never notice other typical hardware failures because they built the service to take care of it transparently.

The other concerning risk is privacy. Your service, data, and your customer’s data are on someone else’s computers. That’s probably true for 99% of the internet. Websites are rarely implemented on dedicated hardware accessible only by the website owner. At a minimum, data center personnel need access to support it, and hosting providers of various sorts technically have access to the contents of all the websites they host.

What protects you as an online provider are strong privacy policies, the integrity of the companies you choose to use (like AWS), and your own ability to appropriately protect the data you keep online.

Alternatives

There are, of course, alternative cloud-service providers. For example, both Google and Microsoft are growing offerings to compete with AWS and each other. From the basics, including raw storage and virtual servers, to the more esoteric online services, both offer the same general benefits and risks as AWS. They differ in pricing as well as other implementation details.

At a more practical level, there’s really no true alternative to the concept of what we call cloud computing. If you have a website or service accessible over the internet and it’s not on a server in your closet or workplace you can see and touch, it’s running on someone else’s computer: the very definition of cloud computing.

You can choose alternative implementations and providers, but as with any data stored online or service provided online, it’s almost certainly on someone else’s computer and subject to all the risks and benefits I’ve discussed here.

That means service providers need to prepare for those risks.

Risk management

I can almost guarantee you that after every large-scale system failure, there are several high-profile companies asking themselves, “How do we avoid this in the future?” They might elect to switch to a different provider or different architecture, but that’s simply trading one set of risks for a very similar set of risks elsewhere.

A more pragmatic question to ask is, “How do we deal with this when it happens again?”

One answer is, “Live with it.” The cost of a contingency plan may well exceed the cost of a short, widespread failure. It might make sense to rely on the fact that all eyes are on Amazon (or whichever provider experiences a failure), and they are under the gun to get the issue resolved quickly as possible.

Another answer is to have a backup plan. “When this happens, then do that…” could mean running in a crippled, less feature-rich mode until the issue is resolved. This really depends on the specifics of the online site or application and how it’s impacted by the failure.

Regardless of whether action is taken, being aware of the possibility and having a plan — even if that plan is to do nothing but say, “I’m sorry, we’re aware of the problem” — is a good result.

My AWS exposure

Since I wrote the original version of this article (in response to a specific AWS outage some years ago), I’ve moved most of my computing infrastructure to AWS.2 I host Ask Leo! and all my other websites on AWS EC2 servers.

So, yes, if AWS experiences a problem, I might be screwed.

I’ve thought about this. A lot. And not just for AWS, but for any hosting service I have used or might use in the future.

Recovery plan #1: do nothing. If there’s a problem at AWS, so many other companies will be similarly affected that the folks at Amazon will scramble to get things resolved as quickly as humanly possible. We’re typically talking hours, not days.

Recovery plan #2: restore from backups 3 elsewhere. Where, I don’t know. That’ll depend on the hosting landscape at the time. The important thing is that I have current backups that are not stored at AWS from which I can rebuild my world. It’ll be painful and a lot of work, but it’ll be possible.

Cloud services: here to stay

The bottom line is that cloud services and providers like AWS are here for the long haul.

Their popularity — highlighted by the number of sites and services that can be impacted by an outage — has soared. I’d say that cloud-service platforms have helped the internet grow in ways it simply would not have otherwise.

Service providers — any service provider, cloud or otherwise — has problems from time to time. The real measure is how they deal with the problems and what they learn from the experience for both providers and their customers.

Update: Human Error

The day after the outage spurring the original version of this article, Amazon posted a nicely detailed report of what happened and what steps they took as a result.

The bottom line?

Human error. In fact, something that might feel a little too close to home: a mistyped command deleting more than intended. Whoops.

Amazon’s response stands as a good example of transparency and responsibility. Errors and failures happen. What matters more is the response.

Do this

Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.

I'll see you there!

Podcast audio

Download (right-click, Save-As) (Duration: 10:55 — 9.6MB)

Subscribe: RSS

Footnotes & References

1: Or another AWS service, EBS: Elastic Block Store.

2: Mostly for fun. Really. Yes, this kinda stuff is fun for me.

3: You knew I was going to mention backups at some point, right?

5 comments on “How Can One Service Take Down So Much?”

Steven

March 7, 2017 at 7:01 pm

I suspect that in the future you will see switching between cloud services for large players. It won’t become feasible on a wide scale due to the costs but there are some businesses and government players for examples who cannot afford to have downtime.
- Leo
  
  March 7, 2017 at 8:15 pm
  
  Indeed, spreading the risk makes a lot of sense.
Judy B.

March 22, 2017 at 12:50 pm

Thank you for posting the news of this unfortunate occurrence. It was very interesting and informative. It was also good to see Amazon’s response letter. In this day and age it is unusual for a business of Amazon’s size to be that transparent. I would put this in your good news column.
Frank

March 21, 2022 at 11:38 am

Proper to ask where to get information on setting up own small server system? Thanks for warning about cloud controlling all my info
- Mark Jacobs (Team Leo)
  
  March 21, 2022 at 12:45 pm
  
  This article isn’t a complete how to, but it might help
  How Should I Set Up My Linux Web Server?