Disk Sizing Dilemma

My company offers a number of different "enterprise cloud" product for our customers, and one of them is a Hosted VM offering.  Customers can choose the level of availability and SLA they want (from 99%, Non-HA to 100%, FT-protected), they get an included amount of VPU, RAM and production storage and then they can build out the server they need from there.  It's a $0 capital alternative to buying a 1U or 2U server, it can be integrated into an existing co-lo environment and we can replicate them between three different markets.  Customers love it, the growth of the product is great and so everything is peachy, right?

The trouble I'm running into as I go through the exercise of scaling out the infrastructure to support our growth over the next 12 months is that there seems to be a huge delta between the amount of space we provide to the VM (.vmdk) and the raw storage needed to support it.

For example:

Let's say (completely hypothetically, of course) that my customers had an average VM disk size of 85 GB.  If that's true, you'd expect that I'd be able to support 824 VMs out of 70 TB of usable space.

However, you have overhead at every step of the way, don't you?  If you (again, completely hypothetically) use a 512 GB LUN sizing as your standard, each of those LUNs is going to use somewhere between 7% and 10% more actual space on the SAN.  While the VMFS partitioning is going to be pretty efficient, you are going to spend space inside the datastore not only for the individual .vmdk files, but also for the RAM each VM is using.  You also need to have some breathing room on the data stores, since you don't want those to fill up, so even if you cut it to 10% overhead, you are losing 10% of your capacity.

By my math, that original 85 GB .vmdk file ends up being more than 133 GB of raw space required, or over 60% more than the actual size of the .vmdk.  That 70 TB of space now only holds 526 VMs, and now I have a whole different business challenge to deal with.  The cost of the 70 TB didn't change, that's for sure.

There's got to be a better way, right?  Thin provisioning might be part of the answer, but I only start to see real gains there by increasing the standard size of the VMFS (thereby increasing the number of VMs I have in each container…), but then I have to start worrying about how many VMs I have per LUN.  Any ideas out there on how I can be more efficient and bring the actual storage needed close to the storage I'm generating revenue from?