As I type this, a surprisingly large number of web sites – including some aspects of Ask Leo! – are recovering from a massive outage at one of the internet’s major cloud service providers: Amazon.
While the specifics of what caused the downtime has yet to be made public, the outage serves to highlight some important aspects of the modern internet, and raise a few questions along the way.
Amazon Web Services
Amazon Web Services, or more commonly just AWS, is the cloud-service arm of Amazon.com.
Years ago, as Amazon was ramping up its retail efforts, they realized that the computing infrastructure they were creating for themselves could also be rented to others. That was the genesis of today’s AWS, now arguably the world’s largest cloud-service provider.
AWS provides a wide and complex range of services. The simplest – S3, for “Simple Storage Service” – is nothing more than online storage. Another relatively simple example is EC2, for “Elastic Compute Cloud”, AWS’s virtual server product.
AWS also includes domain registration and DNS services, database offerings, email and messaging products, and a wide variety of more esoteric services.
All of it runs in Amazon’s data centers, on Amazon’s hardware. Customers pay based on how much they use a given service or mix of services.
I’ll oversimplify greatly by boiling everything down to two primary benefits: cost and scalability.
As I’ve quoted elsewhere, using a cloud service is simply using someone else’s computer. In the case of AWS, rather than purchasing an actual computer to act as a server in a data center somewhere, AWS customers simply create an EC2 virtual computer as needed on Amazon’s hardware. Rather than buying extra hard disks for that computer, they use Amazon’s S3 storage as needed.
It’s often significantly cheaper to purchase services from a cloud-service provider like AWS than it is to purchase, manage, inventory, connect, repair, replace, maintain, and otherwise “deal with” all the issues around having your own hardware. It’s someone else’s computer, and they’re responsible for all those pesky details, allowing you to focus on your website, application, or service.
One of the lesser-known aspects of cloud computing is the ability to quickly respond to changing needs.
As I wrote in that previous article, Ask Leo! is currently running on a virtual server – a software simulation of a “smallish” computer running along with a number of other such simulations on a “really big” computer. If I decide I need my server to be a little less “smallish”, I can change a couple of settings, reboot the server, and within a couple of hours have a new, more powerful server, without needing to reconfigure any of the software on the machine, and without having to deal with any hardware at all.
Imagine if that could happen in real time instead of in a couple of hours. Properly written applications in Amazon’s AWS infrastructure can do exactly that: respond to changes in usage by automatically adding or removing resources as needed. This is really a big deal, and is perhaps one of the more common reasons that very large applications or services elect to host on AWS. They only need – and pay for – a few resources when they’re not busy, and add more as needed. Doing this with your own dedicated hardware is next to impossible, and generally impractical, since it’s your hardware you’re paying for whether you’re using it or not.
There are more benefits, but in my opinion, those are the two biggest.
The third, perhaps contrary to today’s experience, is significantly improved reliability. To discuss that, though, we need to discuss risks in general.
The single biggest risk most people think of is exemplified by today’s failure: your service is running on someone else’s computers, in someone else’s data center, and as a result, your fortunes are, to a large degree, in someone else’s hands. If they suffer an outage, you suffer along with them. When Amazon S3 has a service-wide problem – as it just did – any and all services that rely on it have a problem.
That’s why so many web sites and web services weren’t working today – including a couple of my own.
It is a risk, no doubt. But the counterargument, which I find more persuasive, is that actual noticeable downtime for large cloud services like S3 is incredibly rare, and when something happens, all guns are brought to bear on the problem until it’s resolved. More commonly, if a disk fails in Amazon S3 – which I’m certain happens multiple times a day – we never notice, because the service is built to take care of it.
The risk I take running my own computer in a data center somewhere is actually higher, for me at least. For example, if my disk has a problem, then I have a problem, and I have to fix it. Hopefully I’ve prepared, and hopefully my friends at the data center will take it as seriously as I do.
The other risk that concerns people is privacy. Your service, data, and perhaps more importantly, your customer’s data is on someone else’s computers. As it turns out, that’s probably true for 99% of the internet. Websites are rarely implemented on dedicated hardware accessible only by the website owner. At a minimum, data center personnel need access to support it, and more commonly, hosting providers of various sorts technically have access to the contents of all websites they host. AWS is no different in this regard.
What protects you as an online provider are strong privacy policies, the integrity of the companies you choose to use (like AWS), and your own ability to encrypt and otherwise protect the data you keep online.
There are, of course, alternative cloud-service providers. For example, both Google and Microsoft are growing offerings to compete with AWS as well as each other. From the basics, including raw storage and virtual servers, to the more esoteric online services, both offer the same general benefits and risks as AWS, and differ in pricing as well as other implementation details.
At a more practical level, there’s really no true alternative to the concept of what we’ve come to call “cloud computing”. If you have a website or service that’s accessible over the internet, and it’s not on a server in your closet or workplace that you can see and touch, it’s running on someone else’s computer – the very definition of cloud computing.
You can choose alternative implementations and providers, but as anyone with any data stored online or service provided online, it’s almost certainly on someone else’s computer, and subject to all of the risks and benefits I’ve discussed here.
That means service providers need to prepare for those risks.
I can almost guarantee you that this afternoon, there are a number of high-profile companies asking themselves, “How do we avoid this in the future?” They might elect to avoid AWS and switch to a different provider, or different architecture, but as I hope I’ve made clear, that’s simply trading one set of risks for a very similar set of risks elsewhere.
A more pragmatic question to ask is, “How do we deal with this when it happens again?”
One answer might be “live with it”. By that, I mean that the cost of a contingency plan might well exceed the cost of something like today’s failure. It might just make sense to rely on the fact that all eyes were on Amazon, and they were under the gun to get your issue resolved as quickly as possible, because it was so many others’ issue as well.
Another answer might be to have a backup plan. “When this happens, then do that…” could mean running in a crippled, less feature-rich mode until the issue is resolved. This really depends on the specifics of the online site or application, and how it’s impacted by the failure.
Regardless of whether or not action is taken, being aware of the possibility and having a plan – even if that plan is simply to empower customer service agents to say, “I’m sorry, we’re aware of the problem” – is a good result.
My AWS exposure
All this is very esoteric and obscure, so let me close by giving you a couple of concrete examples of how my online offerings were impacted – both personally and at Ask Leo! itself – and the decisions I may or may not make as a result of this experience.
One example is http://anexampleisp.com/.
This is a domain I own – a single page hosted on Amazon’s S3 data storage service. While S3 was down, this site was inaccessible. I actually have several domains like this, all of which were completely inaccessible during the outage. My takeaway? Live with it. In the unlikely event S3 has a similar problem in the future, these sites will once again be inaccessible. The “cost” of moving them elsewhere – taking on a different set of risks – isn’t worth it.
An example with more impact would be all the Ask Leo! videos not on YouTube.
Many of my books and other offerings include videos. These videos are hosted on Amazon S3, and were similarly inaccessible during the outage. If you’re a registered book owner, for example, and went to members.askleo.com (which, being on my virtual server not at AWS, continued to work just fine) playback of the videos that came with your book failed.
This is a failure mode for which I’ve actually planned ahead. It would take me about 24 hours, but should the need ever arise, I’ve architected how I display those videos in my web pages such that I could move them elsewhere. It would be a lot of work, but it would be doable. Should I ever face an S3 problem lasting a significant amount of time, I’d bury my head in my servers and start the migration.
My business was impacted only by the S3 failure, and only slightly at that. Ask Leo! itself uses several service providers, but my exposure with AWS is pretty limited. I also use Amazon’s DNS services (“Route53”), which was not impacted today.
Cloud services: here to stay
The bottom line is that cloud services and providers like AWS are here for the long haul.
Their popularity – highlighted by the number of sites and services impacted by the outage – has soared. I’d go so far as to say that cloud-service platforms have helped the internet grow in ways it simply would not have otherwise.
Service providers – any service provider, “cloud” or otherwise – might have problems from time to time. The real measure is how the problems are dealt with, and what lessons are learned from the experience – both for providers and for their customers.
Update: Human Error
The day after the outage Amazon posted a nicely detailed report of what happened, and what steps they’re taking as a result. The bottom line?
Human error. In fact, something that might feel a little too close to home: a mistyped command that deleted more than was intended.
Amazon’s response stands as a good example of transparency and responsibility. Errors and failures happen. What matters as much or more is the response.