Cloud Infrastructure Might be Boring, but Data Center Infrastructure Is Hard

I love to point at “converged” infrastructure and call it boring. I even have a slide that I use to kick off some presentations that says it in 66-point font. But you know what isn’t boring? Data center construction and operation. In fact it’s damn hard, and we’ve seen that fact brought front and center over the last couple weeks.

In the six years I worked at Peak 10, from Ops and Engineering Director for the Charlotte data centers to Director of Managed Services, we certainly saw our share of issues. Some of them were operational in nature, as we learned to scale and to adapt to data centers with different gear and requirements. Some were engineering issues with equipment or configurations. I’ve seen UPS systems and generators catch on fire, witnessed improper grounding installation, breaker failures, personnel and planning issues. When you have something that is very complex and grows incredibly fast, as both the company and the data centers were, and things happen.

We certainly won more than we lost over that six year period, and the availability of our services, cloud and facility alike, certainly rivaled the best in the industry. We weren’t alone in our successes or our failures, either. The industry in general saw incredible growth, with data center companies like Digital Realty Trust, Terremark, 365 Main, Switch & Data, Savvis, Hosting.com, Hosted Solutions and many, many others buying or being bought as the smaller players consolidated. Investment groups like ABRY Partners and Welsh Carson and others led the charge in seeking out these high growth companies and making the necessary investments in them. In 2011 alone, there was over $12 billion in M&A activity in the data center sector.

It is, however, noteworthy that not every acquisition that I listed above was precipitated by strong numbers and strong performance. Some acquisitions happen when a company with good facilities realizes that the building alone isn’t enough to keep customers happy. Without an operational acumen and discipline to match, even the best of data centers will go off-line. The 365 Main disaster from 2007 is a representative example, where a facility built to survive anything, didn’t survive a simple power outage. No one was surprised when they were eventually bought by DRT.

So, all this brings us back to the AWS outages in Northern Virginia last week. Looking at the official Amazon summary, there were both facility and software failures involved, an especially painful fact considering how AWS positions the software as the way to remain resilient even when the facility fails. I’m not an expert on the AWS software side, so I’ll leave it to others better qualified to comment on how/why the software failures happened.

But on the facilities front, it’s hard to see how the month of June was anything short of a disaster for Amazon on the data center operations side. Data Center Knowledge has, as always, a fantastic series of recaps and interpretations on their site. On June 14th, an extended outage was experienced when a defective cooling fan on a generator failed, bringing the generator off-line. If that wasn’t bad enough, the failover of the load from the secondary to the tertiary source failed as well when a breaker was misconfigured and opened when the load transfer happened. After the incident, Amazon conducted an audit and found another misconfigured breaker (my guess, it’s the one that handles the load transfer in the opposite direction) which was fixed. Amazon also was quoted as saying:

“We’ve now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.”

Last week, the culprit was “the repeated failure of multiple generators”, which required manual intervention. Oops.

These outages came after a series of high-profile power outages, some in the same data center in 2010. An “electrical ground fault and short circuit” took out power on May 8th, 2010. A botched maintenance caused an outage on May 4th, 2010, compounded by a UPS that didn’t sense an input power failure. Later that day, “human error” caused a generator that was support critical load to shut down.

In each case, AWS was quick to point out that users always have the ability to deploy instances across multiple availability zones, advice which didn’t exactly work as planned with the EBS issues seen last week. They also say that the power incidents are unrelated. I strenuously disagree with this statement, since all of them can be traced back to a lack of discipline in the operation of the data centers in question. I think this quote proves that out perfectly:

In the meantime, Amazon said it would adjust several settings in the process that switches the electrical load to the generators, making it easier to transfer power in the event the generators start slowly or experience uneven power quality as they come online. The company will also have additional staff available to start the generators manually if needed.

What kind of enterprise data center doesn’t have a breaker coordination study completed and on file, both at the time of commissioning and every time the load of the data center crosses a pre-set threshold?
We never, ever did a power maintenance of any kind without notifying customers as far in advance as we could, if for no other reason than so they could have staff available just in case. I’m not aware that Amazon provides its customer any kind of notice of maintenance.
“Electrical ground fault and short circuit” is data center code for “poorly wired but it hadn’t bitten us yet.”
When a human being has access to a generator that is supporting critical load and is able to bring that generator off-line accidentally, you have an operational problem.

When generators and switch gear that was installed and commissioned less than 18 months earlier fails spectacularly, and when the resolution is to start generators manually and adjust the ATS failover process, you have an operational issue. My guess is that the “full load test” that Amazon discusses in their summary was done under controlled conditions, not by cutting input power to the UPS and letting the ATS move the load. If you don’t test under the conditions you expect to see when there’s an actual emergency, that’s not a full load test. Sure, manually opening up input breakers has risks as well, but that’s what you are expecting to happen when the power goes out, right? Awfully nice of Amazon to throw the gear under the bus, and then immediately follow that up with this quote:

Therefore, prior to completing the engineering work mentioned above, we will lengthen the amount of time the electrical switching equipment gives the generators to reach stable power before the switch board assesses whether the generators are ready to accept the full power load. Additionally, we will expand the power quality tolerances allowed when evaluating whether to switch the load to generator power. We will expand the size of the onsite 24×7 engineering staff to ensure that if there is a repeat event, the switch to generator will be completed manually (if necessary) before UPSs discharge and there is any customer impact.

So was this an equipment issue or an operational issue? If it’s equipment-related, why are you changing ATS parameters? My guess is that the original parameters were set and tested under a specific load, and as the draw of the data center increased the settings were never adjusted. Generators, UPS systems, ATS gear, distribution panels, floor PDUs, breaker settings, EVERYTHING behaves differently under load, and if you aren’t keeping the data center processes in sync with your load growth you are asking for these kinds issues.

Of course, it’s possible that the data center and the staff running them don’t belong to Amazon at all. I know that Amazon leases a significant amount of floor space from DRT, although it doesn’t look like the case in NoVA according to this release. Even if it were, however, it would only make Amazon more culpable, not less. It’s one thing if you own the facility and need to salvage the investment, but if you are leasing the space and stay after failures like this you have to ask why. That being said, DRT has a pretty good track record for data center operations, and I’d be very surprised if this was a DRT issue.

So what’s next for Amazon? I wish they would just ditch the US East-1 data center that keeps giving them problems. Of course the vast, vast majority of AWS instances are located there, so that may involve acquiring more floor space. Whatever the costs, it can’t be as expensive as having large customers move away from the service in part because of the availability of the data center. It can’t be more expensive than the miserable PR that they have received (including all their competitors piling on in ways large and small). Sometimes you have to burn the data center to save your reputation.

Bigger picture, this complexity is the reason that companies who need uptime continue to find outsourcing partners. It’s understandable that most of the people in the industry are infatuated by IaaS, PaaS and SaaS, but remember that none of those services are worth a damn if the data center has no power. Do everyone a favor and go find someone who is excellent at keeping the lights on!

Jeramiah.net

Using technology to make the world a little cooler…

Cloud Infrastructure Might be Boring, but Data Center Infrastructure Is Hard

Related

Share this:

Related