It's been more than a week after Amazon's Elastic Compute Cloud (EC2) service went down, taking sites like Reddit, Quora and Foursquare with it. Amazon is finally issuing an apology and in-depth explanation on how and why the cloud service failed so spectacularly.
The issues began a 12:47AM PT on April 21st, when an incorrect network change occurred on Amazon's Elastic Block Storage. NewEnterprise's Arik Hesseldahl notes that Amazon said it will "increase automation" to prevent the mistake from happening again. He says, "From that statement I gather that it was a human-caused mistake that was then exacerbated by the way the cloud system was designed to work."
For those affected by the downtime, Amazon will be providing an automatic 10 day credit "equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone." Additionally, Amazon promises improved communication in the case of future downtime.
Here's the apology, and check out the full technical explanation here.
Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.