Dear Engine Yard Customers,
As many of you know, we experienced a severe outage at our west coast data center yesterday; many of our customers were affected and experienced several hours of downtime. Our engineers became aware of the problem as soon as it occurred, and began the relevant data center escalation procedures.
Engine Yard customers rely on us to run and support their business-critical applications, and that includes relying on our selection of vendors. In this case, we have failed to meet our service level agreements with our west coast customers, and we will, of course, be providing customers with the appropriate service credits.
In the attached report, I have detailed yesterday’s issues, as well as the swift steps we are taking to ensure that this does not happen again. We sincerely apologize for this outage, but are more committed than ever to providing the level of service Engine Yard customers have come to expect.
If you have any additional inquiries, members of our technical teams are available to answer any and all questions; emails can be sent to email@example.com.
Here at Engine Yard we are major supporters of Ruby and Rails. We understand that in order to grow, our ecosystem needs a network of reliable and professional service providers, and we intend to deliver.
Yesterday March 30th, at 9:00 a.m. (PST), our west coast data center experienced a loss of internet connectivity. Our support engineers detected the outage immediately and began investigating the cause. Once we confirmed that the cause was connectivity, we posted the first update to our status blog (9:19 a.m.). We continued to inform customers with new posts as new information was communicated from Herakles.
We were in touch with Herakles senior management for updates at 15 minute intervals. Connectivity began to be restored at approximately 1:30 p.m. and all customers were fully restored by 3:45 p.m. The outage affected about two thirds of our customer base.
Why Did It Happen
Our data-center provider — Herakles — maintains redundant internet uplinks with redundant equipment. Normally the failure of a single internet uplink or switch will prompt a failover event, with minimal loss of connectivity. In this case, however, the route processor of one of the redundant switches (a Cisco 6509) malfunctioned. As part of the malfunction, the device stopped seeing its BGP peers as active, and as such, determined them to have failed. As a result, the device incorrectly promoted itself to master switch and stopped passing traffic inbound or outbound. Complicating the matter, the alerts from the malfunctioning switch that should have notified Herakles monitoring systems of the failure were themselves not routed past the switch.
How It Was Repaired
Herakles data center network engineers worked with Cisco on-site engineers and began debugging the failed switch immediately. The first attempt to repair the switch — by replacing its route processor — failed. After additional trouble-shooting steps, the support engineers physically disconnected the malfunctioning switch, forcing the redundant switch to take over as master. This fully restored traffic, but has now left the internet uplink without switch-level redundancy.
Herakles is currently testing a new redundant switch in its test lab, and will install this during a scheduled maintenance window as soon as possible. When we receive notice of the scheduled maintenance window from Herakles, we will immediately communicate this to customers.
Engine Yard Plans
Starting in September 2008, we began the process of adding an alternative provider to our west coast data center. Our choice was to use our east coast data center connectivity provider as an alternative.
Since the new provider did not yet have a presence in Herakles, this process has taken several months to implement. By April 15th, we will be able to offer this provider as an alternative. At that time we will coordinate with customers who wish to move to the new provider.