Broken Clouds and the Promise of PaaS

Right up front, I’ll let you know I don’t have any additional information on why AWS failed last week, nor do I have any opinion on Amazon’s communication (or lack thereof) during the outage.  I’m not an AWS customer, I’m not a security analyst and I don’t really have a dog in the IaaS fight as it stands today.  If you are looking for any of these things, move along; there’s nothing to see here…

What I do have is an appreciation for, and a perspective on what both the provider and the customer are feeling right now.  So we can talk about that.  I think this event also lays bare some of the fundamental truths that both the providers and customers have forgotten in this era of resource on demand, black box, swipe a credit card to run the infrastructure your multi-million company requires, dirt-cheap compute.  The most central of those truths is also the most basic:

Shit always happens.   

The goal is not to run your infrastructure as cheaply as possible.  It’s not to embrace a service model defined in a NIST paper.  It’s not to leverage local data centers so you can always touch your servers.  It’s certainly not to take your own journey to the cloud.  It’s to run a business.  Define a need in the market, service that need, earn the trust of and grow a customer base and drive whatever metric it is that your business uses to measure success.  NONE of that has anything to do with servers, virtualization, cloud or any of the stuff that broke last week.  So why are there so many people blaming Amazon for the effect that last week’s outage had on their businesses?  Second fundamental rule that’s been forgotten:

Your customer doesn’t care about your infrastructure.

First, let’s stipulate that last week’s failure isn’t anything special.  It’s happened before at both Amazon and others.  365 Main, Rackspace and others have all had their turn in the spotlight for different reasons.  Here’s a good recap of the carnage from 2009.  In every case we’ve heard the same screams/rants/well-articulated thoughts from the customers affected:

1)      Why did I find out from my customers before I found out from my provider?
2)      Why wasn’t support more responsive, especially since I paid for the “premium” level?
3)      Why didn’t my provider communicate better?
4)      Why was my provider overly optimistic about their estimated recovery time?
5)      You mean this is all I get based on my SLA?

Someone should print up a cheat-sheet, fill-in-the-blank style so we can save everyone some cycles the next time a major hosting provider fails.  Third fundamental rule that’s been forgotten:

Blaming your provider for not being able to service your customers is like blaming the rain for getting your car wet when you leave the sunroof open. 

The general movement towards Infrastructure as a Service (please, let’s not use the “C” word here, mmkay?) has had some very useful byproducts.  Because the people running the infrastructure are focused on it as their primary business, the overall state of that infrastructure has gotten better.  Economies of scale have allowed companies who would never have been able to afford it on their own to run their compute workloads on enterprise-class hardware sitting in world-class data centers managed by some of the best in the business.  The reality is that there are dozens of failures every day inside every large hosting company, but because of the design of the infrastructure the customers never see it.  The customer is buying an SLA, usually dealing with availability of the service itself.  Rarely does the customer have direct control over any specific piece of hardware, and in most cases the customer will never see the actual hardware itself.

The downside to this is that because customers don’t see all of the complexity and the failure rate (even when it doesn’t affect their service) they get stuck in a false sense of security, thinking that just the fact that they are hosting the workload with one of these providers means that they are safe.  Another fundamental truth forgotten:

The fact that your provider does such a good job with the infrastructure day-to-day means that when they have a bad day it’s going to be a VERY bad day.  Outages may be less frequent, but when they happen it’s going to be painful.

Think about the complexity of an issue that can take down a huge portion of the AWS storage system.  I bet they (literally) have rocket scientists working on the design of that platform, so when it fails it’s going to be catastrophic.  Hell, if it were something simple, chances are that it would have been caught early on and planned for.  It’s a lot like the space shuttle in that it’s built by some of the smartest people on the planet, and when something goes wrong it REALLY goes wrong.

So what about the customer?  I saw a tweet from @storagezilla this morning about a company who was begging for help on the Amazon forums because their servers were off-line.  You know what their business was? THEY PROVIDED CARDIAC MONITORING TO IN-HOME PATIENTS!!  Seriously?  Here’s a business that is most likely funded by Medicaid/Medicare that is providing vital healthcare to people and they have an application that is hosted in a black box with no redundancy.  Do you think the doctors who rely on getting those monitoring stats care about AWS storage?  Do you think the patients care about where the service is hosted?  I promise they don’t.  Fundamental truth forgotten #5:

Your provider owns the infrastructure.  YOU own the application.

Denis Guyadeen and I had a quick back-and-forth on Twitter this morning and he made the following comment: “If your [sic] an experienced AWS Cloud Architect you design for fail as per best practices.”  I 100% agree with him.  Honestly, AWS gives their customers more resources to protect against failure than most providers, with multiple zones for DR planning.  The problem is that the cheaper and more accessible you make a service, the more the lowest common denominator goes down.  There are possibly tens of thousands of small business who have put their workloads on AWS not because it’s the more resilient or best performing, but because there’s $0 in capital outlay and the workloads are generally cheap to run. I promise none of those businesses have ever see an AWS architecture document.   Losing sight of the customer has long-lasting repercussions and the sad truth is that there will be business that never recover from this.  I can’t find it in me to feel sorry for them any more than I feel sorry for the guy in Vegas who bets his house on a spin of the wheel.  You place your bets and you take your chances.  You can’t be jealous of the winners, and you can’t feel sorry for the losers.

So how does PaaS figure into this?  The move towards outsourcing not just the infrastructure but the application platform itself is something I’m both scared of and optimistic about.  I’m scared because it’s one more level of visibility that the customer is abdicating, which will inevitably lead to them forgetting more truths about that layer.  If there are customers who assume that their workloads are protected just because they are running on a VMware-based IaaS product, what are they going to think when they are programming directly against and consuming in a VMware-based PaaS environment?  How does that affecting coding practices?  How does that affect code security?  These are the things I worry about.

The good news is that I can also see where PaaS can have the same effect on code that IaaS has had on infrastructure.  If the frameworks are being maintained by people who live in that world every day, maybe some of the basics can be built in.  Since IaaS can’t be the silver bullet for application-level redundancy simply because of the number of workloads out there with different personalities, maybe we can start to implement some of that functionality into the framework on level higher.  Imagine if you deployed an app to your provider and had the ability to choose whether it was protected in multiple geographic locations in a way that’s transparent to the end-user.  The data center and the hypervisor can’t make that happen alone, but would it be possible if the application framework were a willing accomplice?  I don’t know, but I’m hopeful.