Elephant in the room: most of the cloud resources are under-utilisedWe almost universally underestimate how long it takes to build a software feature. Not sure it is because our time is felt more precious than money, but for hardware almost always the reverse is true: we always overestimate hardware requirements of our systems. Historically this could have been useful since commissioning hardware in enterprises usually a long and painful process and on the other hand, this included business growth over the years and planned contingency for spikes.
But in an elastic environment such as cloud? Well it seems we still do that. In UK alone £1bn is wasted on unused or under-utilised cloud resource.
Some of this is avoidable, by using elasticity of the cloud and scaling up and down as needed. Many cloud vendors provide such functionality out of the box with little or no coding. But many companies already do that, so why waste is so high?
From personal experience I can give you a few reasons why my systems do that...
Instance RedundancyRedundancy is one of the biggest killers in the computing costs. And things do not change a lot being in the cloud: vendors' availability SLAs usually are defined in a context of redundancy and to be frank, some of it purely cloud related. For example, on Azure you need to have your VMs in an "availability set" to qualify for VM SLAs. In other words, at least 2 or more VMs are needed since your VMs could be taken for patching at any time but within an availability zone this is guaranteed not to happen on all machines in the same availability zone at the same time.
The problem is, unless you are company with massive number of customers, even a small instance VM could suffice for your needs - or even for a big company with many internal services, some services might not need big resource allocation.
Looking from another angle, adopting Microservices will mean you can iterate your services more quickly releasing more often. The catch is, the clients will not be able to upgrade at the same time and you have to be prepared to run multiple versions of the same service/microservice. Old versions of the API cannot be decommissioned until all clients are weaned off the old one and moved to the newer versions. Translation? Well some of your versions will have to run on the shoestring budget to justify their existence.
Containerisation helps you to tap into this resource, reducing the cost by running multiple services on the same VM. A system usually requires at least 2 or 3 active instances - allowing for redundancy. Small services loaded into containers can be co-located on the same instances allowing for higher utilisation of the resources and reduction of cost.
|Improved utilisation by service co-location
This ain't rocket science...
Resource RedundancyMost services have different resource requirements. Whether Network, Disk, CPU or memory, some resources are used more heavily that others. A service encapsulating an algorithm will be mainly CPU-heavy while an HTTP API could benefit from local caching of resources. While cloud vendors provide different VM setups that can be geared for memory, Disk IO or CPU, a system still usually leaves a lot of redundant resources.
Possible best explained in the pictures below. No rocket science here either but mixing services that have different resource allocation profiles gives us best utilisation.
|Co-location of Microservices having different resource allocation profile
And what's that got to do with Microservices?Didn't you just see it?! Building smaller services pushes you towards building ad deploying more services many of which need the High Availability provided by the redundancy but not the price tag associated with it.
Docker is absolutely a must-have if you are doing Microservices or you are paying through the nose for your cloud costs. In QCon London 2015, John Wilkes from Google explained how they "start over 2 billion containers per week". In fact, to be able to take advantage of the spare resources on the VMs, they tend to mix their Production and Batch processes. One difference here is that the Live processes require locked allocated resources while the Batch processes take whatever is left. They analysed the optimum percentages minimising the errors while keeping utilisation high.
Containerisation and availabilityAs we discussed, optimising utilisation becomes a big problem when you have many many services - and their multiple versions - to run. But what would that mean in terms of Availability? Does containerisation improve or hinder your availability metrics? I have not been able to find much in the literature but as I will explain below, even if you do not have small services requiring VM co-location, you are better off co-locating and spreading the service onto more machines. And it even helps you achieve higher utilisation.
By having spreading your architecture to more Microservices, availability of your overall service (the one the customer sees) is a factor of availability of each Microservice. For instance, if you have 10 Microservices with availability of 4 9s (99.99%), the overall availability drops to 3 9s (99.9%). And if you have 100 Microservice, which is not uncommon, obviously this drops to only two 9s (99%). In this term, you would need to strive for a very high Microservice availability.
Hardware failure is very common and for many components it goes above 1% (Annualised Failure Rate). Defining hardware and platform availability in respect to system availability is not very easy. But for simplicity and the purpose of this study, let's assume failure risk of 1% - at the end of the day our resultant downtime will scale accordingly.
If service A is deployed onto 3 VMs, and one VM goes down (1%), other two instances will have to bear the extra load until another instance is spawned - which will take some time. The capacity planning can leave enough spare resources to deal with this situation but if two VMs go down (0.01%), it will most likely bring down the service as it would not be able cope with the extra load. If the Mean Time to Recovery is 20 minutes, this alone will dent your service Microservice availability by around half of 4 9s! If you have worked hard in this field, you would know how difficult it is to gain those 9s and losing them like that is not an option.
So what's the solution? This diagram should speak for more words:
|Service A and B co-located in containers, can tolerate more VM failures
By using containers and co-locating services, we spread instance more thinly and can tolerate more failures. In the example above, our services can tolerate 2 or maybe even 3 VM failures at the same time.