Opinion on the Amazon S3 Outage; Checklist for Dealing with Outages

My journalist colleagues at Wired.com published some of my comments related to Amazon S3.1 Wired also posted another article titled Customers Shrug Off S3 Service Failure. I agree with the views of many of the customers expressed in the article. Don MacAskill, CEO of the popular photo hosting site Smugmug, wrote an understanding post about it.

My entire career working for media companies, I’ve held firm the belief that the uptime, reliability, performance, scalability, performance and security of commercial Web sites is of paramount importance. When sites that I’ve been responsible for have had issues, my colleagues and I have given our personal time and energy to resolution. With my teams, I spend considerable time on proactive measures. I’ve had the honor of working closely with and learning from some who do an excellent job running technology operations.

Experience has taught that things can and sometimes do go wrong. Sometimes calculated risks don’t pan out. Sometimes mistakes cause problems. We are human. We should strive for perfection; we can get close to it, but not fully attain it. We should be prepared for such scenarios. When they happen, we should work diligently and expeditiously on resolution and have frequent and honest communications with stakeholders and customers. Such communications during the incident should include:

During-Incident Communication Checklist

  • Current status
  • What is the full impact?
  • Estimated time to resolution
  • Any recommended workarounds until resolution, if practical
  • Assurance that it is being worked on
    • It often helps to mention who all are working on it and what they are doing

The post-incident communications to stakeholders and customers should include:

Post-Incident Communication Checklist

  • Summary
  • What happened, how and why it happened?
    • Including full description of all impact
    • Do not blame2 third-parties or say things like “beyond our control”. A technology leader takes responsibility equally for both insourced and outsourced products and services.3
  • How it was resolved
    • If the resolution is temporary or long-term
  • Next steps
  • Plan for eliminating or minimizing this and similar incidents from happening again
  • Thank all those who helped resolve and the customers for their understanding
  • Mention the monetary credits you plan to give as per the Service Level Agreement (SLA)
    • Specify any additional ‘make goods’ or returns you plan to make to the customers above and beyond the credits as per SLA, if appropriate.
  • Double check each recipient’s email address to make sure you are sending this memo which may contain confidential information to the correct person and not someone else with a similar name in your address book. You don’t want your memo published on Gawker.
  • Speaking of Gawker, in the event someone does leak your memo outside the beyond the intended recipients, take care to not say anything in it that would be an embarrassment. That’s another reason to be honest, own the problem and solution, and not pass the blame.

Stakeholders and customers here refer to internal customers of the technology operations team (e.g. the concerned folks in editorial, marketing, sales, finance, legal and other departments). External communications to the public Internet should be handled in consultation with legal and public relations.

S3’s outage (or any outage) isn’t to be taken lightly, but I have faith Amazon and their customers will learn from it.

Disclaimers:

  • As explained in the terms of use of this site, any opinions expressed on my personal Web site do not reflect those of any employer, past or present. My Web site and I in my personal life neither represent nor speak for any corporation.
  • I have no affiliation, financial or otherwise with Amazon.com. I happen to be a user of their products and services, some of which I like and some that I don’t.
  • Personal Web sites like this are exempt from the performance requirements of corporate Web sites :-) My personal Web site is for expressing, learning and R&D. It also happens to be hosted on Amazon EC2 and S3.
  1. Silicon Alley Insider and ValleyWag have amusing spins on it. :-) []
  2. There may be extreme instances, especially when criminal activity or malicious wrongdoing was the cause where it would be appropriate to blame someone. []
  3. It is ok to mention service providers, or describing external events for explaining what happened, but don’t do it in a “it was their fault, not ours” tone. The technology leader should factually describe what happened and take responsibility. []

This Web Site is Now Hosted on Amazon EC2 & S3

This web site, www.rajiv.com is now hosted on Amazon.com’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3) services. They are part of Amazon Web Services offerings. If you are a technologist, I recommend EC2 and S3. To learn more about them, you can follow the links in this article.

Benefits of hosting a Web site on EC2 & S3

  • The hosting management is self-service. Anytime you want, you can provision additional servers yourself and immediately. Unlike with most traditional hosting companies, there is not need to contact their staff and have to wait for them to set up your server. On EC2, once you have signed up for an account and set up one server, you can provision (or decommission) additional servers within minutes. Even the initial setup is self-service.
  • EC2 enables you to increase or decrease capacity within minutes. You can commission one or hundreds of server instances simultaneously. Because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs. Billing is metered by an hour as the unit. This flexibility of EC2 can benefits many use cases:
    • If your web sites get seasonal traffic (e.g. a fashion site during shows) or can temporarily get much higher traffic for a period of time (e.g. a news site), EC2’s business model of pay for what you use by the hour, is cost-effective and convenient.
    • If yours is the R&D or Skunkworks group at a large or medium size organization or a startup company with limited financial resources, renting servers from EC2 can have many benefits. You don’t have to make a capital investment to get a server farm up and running, nor make long-term financial commitments to rent infrastructure. You can even turn off servers when not in use, greatly saving costs.
  • It allows me to use the modern Ubuntu1 GNU/Linux operating system, Server Edition. Among Ubuntu’s many benefits are its user friendliness and ease of use. Software installations and upgrades are a breeze. That means less time is required to maintain the system while retaining the flexibility and power being a systems administrator gives.
  • EC2 has lower total cost ownership for me than most hosting providers’ virtual hosting or dedicated server plans. Shared (non virtual server) hosting is still cheaper, but no longer meets my sites’ requirements.2

Potential drawbacks/caution with EC2 & S3

  • While S3 is persistent storage, EC2 virtual server instances’ storage does not persist across server shutdowns. So if your web site is running a database and storing files on an EC2 instance, you should implement scheduled, automated scripts that regularly back up your database and your files to S3 or other storage.
    • Consistent with what I read in some comments online, my EC2 virtual server instance did not lose its file-system state or settings when I rebooted it. So rebooting seems to be safe.3
    • This potential drawback is arguably a good thing in some ways. It compels you to implement a good backup and recovery system.
    • This also means that after installing all the software on your running Amazon Machine Image (AMI), you should save it by creating a new AMI image of it as explained in the Creating an Image section of the EC2 Getting Started Guide.
      • This is an issue since you may want to do this every time after you update your software, especially with security patches. Until Amazon implements persistent storage for EC2 instances, you could do this monthly. You can script this to be partly or fully automated. Since Amazon’s EC2 instances are quite reliable, this is not a major concern.
  • An EC2 instance’s IP address and public DNS name persists only while that instance is running. This can be worked around as described under the tech specs section below.

Some articles about Amazon’s hosting infrastructure services:

Tech specs of my site:

  1. www.ubuntu.com []
  2. I plan to split rajiv.com into separate sites, The India Comedy site will move to comedy.rajiv.com and the SPV Alumni site will move to spv.rajiv.com. The latter two are community sites and will benefit from a community CMS like Drupal. []
  3. However, please be aware of a known issue that on some occasions caused instance termination on reboots. []
  4. I created my AMI virtual machine by building on top of a public Ubuntu AMI by Eric Hammond. []