Amazon has suffered a disruption in its EC2 service, sparking furious debate in the community.
On Thursday, April 21st at 1.41 AM PDT, Amazon suffered a disruption in its EC2 service, which took down a string of websites, including Hootsuite, Reddit, Foursquare, Heroku, Quora, and more.
Some have theorised that the outage was caused by “auto-immune disease” where Amazon’s automated processes began re-mirroring a large number of EBS (Elastic Block Store) columns. This could have significantly degraded EBS and RDS performance and availability, and affected more than one availability zone. Amazon have not yet posted a root cause analysis.
This outage has caused some to question the future of cloud computing, although technical Director of Atalanta Systems, Stephen Nelson-Smith, has argued that we can benefit from this outage by focusing on what we can learn from the event, becoming better prepared for future disruptions. He proposes the following:
- Expect, and prepare for, downtime. Nelson-Smith advises Amazon Web Services (AWS) users to make use of autoscaling groups, to deploy in more than two availability zones, and include headroom for load spikes. He also suggests having a written plan, to follow in case of downtime.
- Think about how you use EBS. He warns against expecting EBS to behave in the same way as a NetApp, and expecting to use EBS effectively if the network is saturated. He also points out that EBS snapshots should not be used as a backup.
- Consider working towards a vendor-neutral architecture.
“Outages are part of life – get used to it,” he summarises, urging the community not to shy away from cloud computing following the Amazon outage:
“One, albeit major, outage in one region of one cloud vendor doesn’t mean the cloud was a big con, a waste of time, a marketing person’s wet dream. The emperor isn’t naked, and the nay-sayers are simply enjoying their day of ‘I told you so’. The cloud is here to stay, and brings with it huge benefits to the IT industry. However, it does require a different approach to building systems. The cloud is not dead – it’s still great.”
George Reese takes this argument one step further, and claims that the outage exposes the strong points of cloud computing: namely, that it puts the developer in control of application availability. He states that the outage wasn’t Amazon’s fault; those whose systems failed either deemed an outage an acceptable risk, or failed to design for Amazon’s cloud computing model. Reese highlights Netflix as one AWS customer who managed to keep going throughout the outage. “Try doing that in your private IT infrastructure with the complete loss of a data center,” he says.
Software-as-a-Service development company, Lecere’s FIRMS, also reportedly managed to keep going throughout the outage, by diverting its cloud-based system to Amazon Web Services’ west coast service.
However others, such as Klint Finley place the blame on AWS, and not on customers using EBS: “AWS has been offering the EBS service since 2008. It’s not considered a “beta” product. Why shouldn’t customers be able to rely on it?” he argues.
Amazon have yet to post a statement regarding the disruption.