Ouch

Today was brutal, and my head hurts.

"We choose to [do these] things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win…"

I thought of this quote from JFK today as we were beating our heads against another series of issues and challenges that seem to only come about because of our multi-tenant/service provider business model.  I have discovered that one of my pet peeves is hearing someone say "Man, if this were an enterprise environment we'd be done already!"  The reality is that we continue to run into things, large and small that just take time to work through.

Some of our hurdles over the lat couple days include:

  • Network, network, network.  We knew coming in that the legacy network that had been in place for 10+ years was gonig to be in an issue.  What I didn't realize was how much of an issue.  What appeared to me just to be an overly simple network design (completely separate physical networks for every service we are delivering) has become our single biggest challenge.  Getting eveyrthing from the Nexus7k down to the single pair of 10GB vNICs has been an exercise in engineering flexibility upstream.  We've identified 5 different provisioning scenarios that require different network layouts, and trying to fit that into our internal processes and deliver it real-time to the customer is going to be difficult at best.  A lot of thinking is going to go into this over the next two days in an effort to bring as much operational efficienty to the network as we can.
  • Storage.  The VMax has been…uncooperative.  We still don't have the VMFS LUNs presented, or more correctly we do have them presented, and we can see them, but the devices show as inactive and inaccessible.  The boot from SAN has been a challenge, and we've struggled to determine if the issues are on the UCS/scripting side, or are on the storage array itself.  The team is busting it to get this behind us, and hopefully we'll have it solved early in the day today.
  • Capacity vs. IO.  As we sat with David Gadwah yesterday and went through the low watermark and high watermark builds for the Type 1 vBlock, we came to the conclusion that with the standards as they are now, we might not be able to generate enough revenue out of the platform to afford it!  Talk about taking the wind out of my sails.  The default storage build for the Type 1 includes about 40Tb of usable storage, split across EFD, FC and SATA disk.  This buildout achieves an aggregate number of about 35,720 IO/s, which is just incredibly high for our real world load.  We are pushing to maintain the IO but reconfigure how we get there.  Hopefully we can increase the amount of storage we get to have, since that's going to track pretty directly to the revenue we can generate.
  • Compute node limitations. The basic Type 1 vBlock includes one pair of Nexus 6120 switches and four UCS chassis.  We've been getting some pushback on whether or not we can add additional compute nodes to the same Nexus 7k/CX480 stack.  Being able to scale compute nodes would definitely help the revenue generation model, but no one seems to know how far we can push the vBlock reference before it's not a vBlock anymore.

Tomorrow is a new day, and we are gonig to start tackling these one at a time.  It's ironic that the technology is the least of the concerns at this point, and that we are hyper-focused on the business and configuration side.  Luckily we continue to have an incredible amount of resources provided for us by the VCE team and I'm confident that we can make this work.