Opinion on the Amazon S3 Outage; Checklist for Dealing with Outages

My journalist colleagues at Wired.com published some of my comments related to Amazon S3.1 Wired also posted another article titled Customers Shrug Off S3 Service Failure. I agree with the views of many of the customers expressed in the article. Don MacAskill, CEO of the popular photo hosting site Smugmug, wrote an understanding post about it.

My entire career working for media companies, I’ve held firm the belief that the uptime, reliability, performance, scalability, performance and security of commercial Web sites is of paramount importance. When sites that I’ve been responsible for have had issues, my colleagues and I have given our personal time and energy to resolution. With my teams, I spend considerable time on proactive measures. I’ve had the honor of working closely with and learning from some who do an excellent job running technology operations.

Experience has taught that things can and sometimes do go wrong. Sometimes calculated risks don’t pan out. Sometimes mistakes cause problems. We are human. We should strive for perfection; we can get close to it, but not fully attain it. We should be prepared for such scenarios. When they happen, we should work diligently and expeditiously on resolution and have frequent and honest communications with stakeholders and customers. Such communications during the incident should include:

Update 2010-Jan-24: This checklist is now maintained on the Checklists Wiki Web site at:

www.checklistnow.org/wiki/IT_Incident_Reporting

During-Incident Communication Checklist

  • Current status
  • What is the full impact?
  • Estimated time to resolution
  • Any recommended workarounds until resolution, if practical
  • Assurance that it is being worked on
    • It often helps to mention who all are working on it and what they are doing

The post-incident communications to stakeholders and customers should include:

Post-Incident Communication Checklist

  • Summary
  • What happened, how and why it happened?
    • Including full description of all impact
    • Do not blame2 third-parties or say things like “beyond our control”. A technology leader takes responsibility equally for both insourced and outsourced products and services.3
  • How it was resolved
    • If the resolution is temporary or long-term
  • Next steps
  • Plan for eliminating or minimizing this and similar incidents from happening again
  • Thank all those who helped resolve and the customers for their understanding
  • Mention the monetary credits you plan to give as per the Service Level Agreement (SLA)
    • Specify any additional ‘make goods’ or returns you plan to make to the customers above and beyond the credits as per SLA, if appropriate.

Stakeholders and customers here refer to internal customers of the technology operations team (e.g. the concerned folks in editorial, marketing, sales, finance, legal and other departments). External communications to the public Internet should be handled in consultation with legal and public relations.

S3′s outage (or any outage) isn’t to be taken lightly, but I have faith Amazon and their customers will learn from it.

Disclaimers:

  • As explained in the terms of use of this site, any opinions expressed on my personal Web site do not reflect those of any employer, past or present. My Web site and I in my personal life neither represent nor speak for any corporation.
  • I have no affiliation, financial or otherwise with Amazon.com. I happen to be a user of their products and services, some of which I like and some that I don’t.
  • Personal Web sites like this are exempt from the performance requirements of corporate Web sites :-) My personal Web site is for expressing, learning and R&D. It also happens to be hosted on Amazon EC2 and S3.
  1. Silicon Alley Insider and ValleyWag have amusing spins on it. :-) []
  2. There may be extreme instances, especially when criminal activity or malicious wrongdoing was the cause where it would be appropriate to blame someone. []
  3. It is ok to mention service providers, or describing external events for explaining what happened, but don’t do it in a “it was their fault, not ours” tone. The technology leader should factually describe what happened and take responsibility. []

Social Graphs API: WordPress Plugin: Blogroll Links

If you already know what the Social Graph API and XFN are, you can skip the background information and go directly to the Blogroll Links plugin for WordPress that is designed to work with these.

Update: 2010-Feb-20: Version 2 of the Blogroll Links plugin for WordPress uses the Shortcode API and so introduces a new code-tag format. The new plugin still supports the old (now deprecated) code-tag format for backwards compatibility. See below for examples.

Social Graph API

Google recently announced the Social Graph API.1 From Google’s Code site:

With so many websites to join, users must decide where to invest significant time in adding their same connections over and over. For developers, this means it is difficult to build successful web applications that hinge upon a critical mass of users for content and interaction. With the Social Graph API, developers can now utilize public connections their users have already created in other web services. It makes information about public connections between people easily available and useful.

We (Google) currently index the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections. By supporting open Web standards for describing connections between people, web sites can add to the social infrastructure of the web.

The Google Code site also has a video introduction to the open social graph:

YouTube Preview Image

The Google Code site has some interesting example applications. To see the power of the open social graph, follow these links:

All I did was enter my home page http://www.rajiv.com/ into these applications and got the results linked to above.

XHTML Friends Network, a component of open social networks

XFN (XHTML Friends Network) is a simple way to represent human relationships using hyperlinks. In recent years, blogs and blogrolls have become the fastest growing area of the Web. XFN enables web authors to indicate their relationship(s) to the people in their blogrolls simply by adding a ‘rel‘ attribute to their <a href> tags, e.g.:

<a href="http://www.rajiv.com/" rel="friend met">Home Page: Rajiv Pant</a>

The above link means that the page at http://www.rajiv.com/ belongs to a friend of the person who who owns the page this link is placed on. The met tag specifies that the two friends have met in real life. The link above would not be placed on a page owned by Rajiv Pant. It would be placed by a friend on their page, for example, on http://www.paradox1x.org/

Here is another example:

<a href="http://photos.rajiv.com/" rel="me">Photo Albums: Rajiv Pant</a>

This link states that the page at the URL http://photos.rajiv.com/ belongs to the same person who owns the page this link is placed on. For example, the above link would be placed on http://www.rajiv.com/ telling the Web that the URLs http://photos.rajiv.com/ and http://www.rajiv.com/belong to the same person.

To find out how to write and use XFN, or to write a program to generate or spider it, visit the XFN Web site.

Blogroll Links Plugin for WordPress

For people who maintain their Web site or blog using the WordPress blog content management system, I created an open source plugin called blogroll-links that uses WordPress’ built-in Blogroll feature2 and presents links to friends’ home pages and own pages on social networking sites using XFN in the links.

Features of this plugin

  • It can show the links by category in blog posts and WordPress Pages.
  • It uses WordPress’ standard built-in Blogroll links database. There is no hassle of another list of links to maintain.
  • It can be used to show only the links assigned to a particular category, by stating the category slug as defined in that category’s setting in WordPress.
  • It honors the Show/Hidden setting as defined for each link in WordPress.
  • It displays the link in the same window or new window, as specified for each link in WordPress.

See this plugin in action

  • http://www.rajiv.com/friends/
    • The two lists, first one of links to my own pages on various social networking sites and the second one of links to some of my friends’ pages are generated by this plugin. Yes, those social networks’ logo pictures are also taken by the plugin from the WordPress standard Blogroll links. Code:
    • <h3>My Pages on Social Networking Sites</h3>
      [blogroll-links categoryslug="rajiv-web" sortby="link_name" sortorder="desc"]
      <h3>Web Sites of Some People I Know</h3>
      [blogroll-links categoryslug="people" sortby="link_name" sortorder="desc"]
  • http://www.rajiv.com/charity/
    • This list of charitable organizations with brief descriptions is generated by the plugin. Code:
    • [blogroll-links categoryslug="charity"]
  • http://www.rajiv.com/blog/2004/08/02/search-engines/
    • This list of search engines is maintained as Blogroll links in WordPress. Code:
    • [blogroll-links categoryslug="search-engines"]
  • http://www.rajiv.com/
    • The featured links shown under the “What’s featured here?” section shows the links I’ve categorized as featured in WordPress’ Blogroll links. Code:
    • <a title="featured" name="featured"></a>
      <h2>What's featured here?</h2>
      [blogroll-links categoryslug="featured" sortby="link_name" sortorder="desc"]

Download & install plugin

  1. WikiPedia article explaining what an API, or application programming interface is. []
  2. It does not make you maintain yet another list of links []

The Kite Runner by Khaled Hosseini (Book Review)

The Kite Runner by Khaled Hosseini was as enjoyable to read on the Amazon Kindle as it would have been in a printed book. I started reading it on the plane during my flight back to NYC from Charlotte after speaking at a conference there. It took two evenings to complete.

The story is gripping and emotional: It makes use of back references and coincidences that fit in well for such a story touching Eastern cultures and societies. The descriptions of Afghanistan make you feel like you can relate to the place that is foreign to many of us. The depiction of the immigrant community in the San Francisco Bay Area feels like a genuine experience. Even though the author relates the storyline to Afghan history giving the tale a realistic feel, he does not dwell much into narrating the actual historical events like a part-history book. Instead, the book focuses on the characters and the plot, making it a thrilling experience to read throughout. The story isn’t a light read: It describes some of life’s gruesome realities. Overall, while he does employ cultural stereotypes, the author has captured the essence of different cultures and represents them well.

In most parts, the story feels real, as if it was someone’s amazing autobiography. Some coincidences do, however feel too eerie to be true. I recommend it.

Rating: ★★★★☆

Below is an introduction to the book in a video interview with the author.

YouTube Preview Image

This Web Site is Now Hosted on Amazon EC2 & S3

This web site, www.rajiv.com is now hosted on Amazon.com’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3) services. They are part of Amazon Web Services offerings. If you are a technologist, I recommend EC2 and S3. To learn more about them, you can follow the links in this article.

Benefits of hosting a Web site on EC2 & S3

  • The hosting management is self-service. Anytime you want, you can provision additional servers yourself and immediately. Unlike with most traditional hosting companies, there is not need to contact their staff and have to wait for them to set up your server. On EC2, once you have signed up for an account and set up one server, you can provision (or decommission) additional servers within minutes. Even the initial setup is self-service.
  • EC2 enables you to increase or decrease capacity within minutes. You can commission one or hundreds of server instances simultaneously. Because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs. Billing is metered by an hour as the unit. This flexibility of EC2 can benefits many use cases:
    • If your web sites get seasonal traffic (e.g. a fashion site during shows) or can temporarily get much higher traffic for a period of time (e.g. a news site), EC2′s business model of pay for what you use by the hour, is cost-effective and convenient.
    • If yours is the R&D or Skunkworks group at a large or medium size organization or a startup company with limited financial resources, renting servers from EC2 can have many benefits. You don’t have to make a capital investment to get a server farm up and running, nor make long-term financial commitments to rent infrastructure. You can even turn off servers when not in use, greatly saving costs.
  • It allows me to use the modern Ubuntu1 GNU/Linux operating system, Server Edition. Among Ubuntu’s many benefits are its user friendliness and ease of use. Software installations and upgrades are a breeze. That means less time is required to maintain the system while retaining the flexibility and power being a systems administrator gives.
  • EC2 has lower total cost ownership for me than most hosting providers’ virtual hosting or dedicated server plans. Shared (non virtual server) hosting is still cheaper, but no longer meets my sites’ requirements.2

Potential drawbacks/caution with EC2 & S3

  • While S3 is persistent storage, EC2 virtual server instances’ storage does not persist across server shutdowns. So if your web site is running a database and storing files on an EC2 instance, you should implement scheduled, automated scripts that regularly back up your database and your files to S3 or other storage.
    • Consistent with what I read in some comments online, my EC2 virtual server instance did not lose its file-system state or settings when I rebooted it. So rebooting seems to be safe.3
    • This potential drawback is arguably a good thing in some ways. It compels you to implement a good backup and recovery system.
    • This also means that after installing all the software on your running Amazon Machine Image (AMI), you should save it by creating a new AMI image of it as explained in the Creating an Image section of the EC2 Getting Started Guide.
      • This is an issue since you may want to do this every time after you update your software, especially with security patches. Until Amazon implements persistent storage for EC2 instances, you could do this monthly. You can script this to be partly or fully automated. Since Amazon’s EC2 instances are quite reliable, this is not a major concern.
  • An EC2 instance’s IP address and public DNS name persists only while that instance is running. This can be worked around as described under the tech specs section below.

Some articles about Amazon’s hosting infrastructure services:

Tech specs of my site:

  1. www.ubuntu.com []
  2. I plan to split rajiv.com into separate sites, The India Comedy site will move to comedy.rajiv.com and the SPV Alumni site will move to spv.rajiv.com. The latter two are community sites and will benefit from a community CMS like Drupal. []
  3. However, please be aware of a known issue that on some occasions caused instance termination on reboots. []
  4. I created my AMI virtual machine by building on top of a public Ubuntu AMI by Eric Hammond. []