Friday, 20 March 2015

Spirit of Open Source, are you ready for it?

[Level C1]

Companies and their trademarks have an existence in the society. They exist with a unique identity more or less similar to an individual - and companies try to associate their identity with a positive picture. Like individuals, companies get our respect - or disrespect - mainly by their behaviours and actions. Similar to an individual, they build relationships, trust and rapport. This can make the individual to react in a way that is beyond the usual contract (e.g. buying a product or keeping a subscription) - it can make us react with loyalty, evangelism and as we know, fanboiism. Or it can make us do all the same in the negative fashion, pick up a banner and dedicate our time and resources against them.

This stuff fascinates me - really really fascinates me.  The public profile of a company might not exactly correlate to its good or evil nature - company-product-consumer is a very complex relationship. Now in the software world, this gets somehow even more serious - mainly because of the era we live in. And I am sure there is a wealth of literature (as well as marketing mumbo jumbo) but let's bring this home to a tangible example: Microsoft.

Microsoft had been viewed as a company with occasional predatory behaviour that makes software that is often buggy and sometimes badly designed - I suppose to put it mildly from the community point. I am personally paid by working mainly on the Microsoft stack so although I do not have an affiliation, I would like Microsoft to thrive. And I have my gripes too hence I decided to diversify and be able to work on the Linux/Python stack too. Reality is Microsoft has changed so much over the last 1-2 years ago. Adoption of Open Source, open sourcing technologies, supporting its main cash cow (aka Office) on Mac and Linux is all completely unbelievable even for those close to Microsoft. And there is the promise of more upcoming change.

Has the public profile of the company changed since? I think the answer is No, but why? What else should Microsoft do? This is the question that has been bugging me and I asked on HackerNews. I think the time will tell but I think one major problem is the internal culture - as rightly Satya Nadella puts it: "... you have to have new capability and culture to go after those new concepts". Indeed, culture eats strategy for breakfast.

Now, open sourcing is not the same as making the code available publicly. In fact, distributing the code publicly is the trivial first step. I am not going to lecture on what Open Source is or is not - and I do not think I quite qualify for that - but obviously interaction with the community and taking meaningful contribution from the community are the next steps. And then there is the willingness and empathy to solve issues community is dealing with.

But perhaps above all, open sourcing makes you vulnerable. It exposes you. And you have got to have the guts to face the criticism you get from the community. And there is spamming and trolling, you have got to deal with it gracefully and intelligently.

.NET community had been traditionally quite behind other communities in Open Source. A lot has been changed over the last 5 years and there are numerous vibrant and healthy open source projects out there - and I am eternally grateful that I have been part of a few over the last 3 years. While we are catching up with regard to the repos, those edgy sides of Open Sourcing are also catching up with us.

You might have heard that .NET's fairly unpopular (again to put it rather mildly) build tool called msbuild hit the GitHub and was open sourced by Microsoft. Some welcomed the moved while for others, who during their career had to battle with the tool wasting countless hours trying to get the build working, only brought back bad memories. Deficiencies in msbuild made some building their own build tool, and for many more it contributed to their leaving .NET altogether (TFS I am also talking to you!). And as I mentioned, it made Rob Ashton, a colourful yet prolific character in the .NET/ployglot community to submit a pull request that was replacing msbuild with make!

Honestly, I am not a drama guy and have never met Rob personally. Judge for yourself if this is going too far or is just a light humour - me personally considering it is msbuild, take it as humour. But I know this: if you have built a tool that has made many people over the years to cry using it, either you do not open source it or you accept the consequences and apologetically work with the community to gain back their trust.

As for the countless insults to Rob, it is not acceptable - the comment from the repo admin himself (Microsoft staff) was gentle and graceful. I have heard that Rob has deleted his twitter account after numerous insults calling him names and harassing him on HackerNews. I think if it is true, it will be a real loss. And I feel Microsoft, if it wants, can win a lot of trust brownies by publicly commenting on the issue. And also to be careful with what it open sources. Don't you even think about open sourcing TFS...

Tuesday, 10 March 2015

QCon London 2015: from hype to trendsetting - Part 1

Level [C3]

This year I could make it to the QCon London and I felt it might be useful to write up a summary for those who liked to be there but did not make it for any reason. This will also an opportunity to get my head together and summarise a couple of themes, inspired by the conference.

Quality of the talks was varied and initially pretty disappointing on the first day but rose to a real high on the last day. Not surprisingly, Microservices and Docker were the buzzwords of the conference and many of the talks had one or the other in their title. It was as if, the hungry folks were being presented Microservices with ketchup and next it would be with Mayonnaise and yet nothing as good as Docker with Salsa. In fact it is very easy to be skeptic and sarcastic about Microservices or Docker and disregard them as a pure hype.

After listening to the talks, especially ones on the last day, I was convinced that with or without me, this train is set to take the industry forward. Yes, granularity of the Microservices (MS) have not been crisply defined yet, and there is a stampede to download and install Microservices on old systems and fix the world. Industry will abuse it as it reduced SOA to Web Services by adding just a P to the end. Yes, there are very few people talking about the cost of moving to MS and explore the cases where you should stay put. But if your Monolith (even though pays lip service to SOA) has ground the development cycle to a halt and is killing you and your company, there is a thing or two to learn here.

Disclaimer: This post by no means is a comprehensive account of the conference. This is my personal take on QCon London 2015 and topics discussed, peppered with some of my own views, presented as a technical writing.


Yeah I know you are fed up with hearing the word - but bear with me for a few minutes. Microservices reminded me of my past life: it is a syndrome. A medical syndrome when it is first being described, does not have to have the Aetiology and Pathophysiology all clear and explained - it is just a syndrome, a collection of signs and symptoms that occur together. In the medical world, there could be years between describing a syndrome and finding what and why.

And this is what we are dealing here within a different discipline: Microservice is an emerging pattern, a solution to a contextual problem that has indeed occurred. It is a phenomenon that we are still trying to figure out - a lot of head scratching is going on. So bear with it and I think we are for a good ride beyond all the hype.

Its two basic benefits are mainly: smaller deployment granularity enabling you to iterate faster and smaller domain to focus, understand and improve. For me the first is the key.

So here are a breakdown of few key aspects of the Microservices.

Conway, Conway, Where Art Thou

A re-occurring theme (and at points, ad nauseum) was that MS is the result of reversing cause and effect in the Conway's law and using it to your advantage: build smaller teams and your software will shape like it. So in essence, turning Conway's law on its head and use it as a tool to naturally result in a more loosely coupled architecture.

This by no means is new, Amazon has been doing this for a decade. Size of the teams are nicely defined by Jeff Bezos as "Two Pizza Teams". But what is the makeup of these teams and how do they operate? As again described by Amazon, they are made up of elements of a small company, a start-up, including developers, testers, BA, business representative and more importantly operations, aka Devops.

Another point stressed by Yoni Goldberg from Gilt and Andy Shoup was that the teams charge other teams for using their services and need to look after their finances. They found that doing this reduced costs of the team by 90% - mainly due to optimising cloud and computing costs.

Granularity: "fits in my head" (does it?)

One of the key challenges of Microservices was to define the granularity of a Microservice differentiating it from the traditional SOA. And it seems we have now up a definition: "its complexity fits one's head".

What? This to me is a non-definition and on any account, it is a poor definition (sorry Dan). After all, there is nothing more subjective than what fits one's head, is it? And whose head by the way? if it is me, I cannot keep track of what I ate for breakfast and lunch at the same time (if you know me personally, you must have noticed my small head) and then we get those giants that can master several disciplines or understand the whole of an uber-domain.

One of the key properties of a good definition is that it is tangible, unambiguous and objectively prescriptive. Jeff Bezos was not necessarily a Pizza lover to use it to define Amazon team sizes.

In the absence of any tangible definition, I am going to bring my own - why not? This is really how I feel like the granularity of a MS should be, having built one or two, and I am using tangible metrics to define it.

Granularity of Microservices - my definition

As it is evident, Cross-cutting concerns of a Microservice are numerous. From security, availability, performance to routing, versioning, discovery, logging and monitoring. For a lot of these concerns, you can rely on the existing platform or common platform-wide guidelines, tools and infrastructure. So the crux of the sizing of the Microservice is its core business functionality, otherwise with regard to non-functional requirements, it would share the same concerns as traditional services.

When not to Microservice

Yoni Goldberg from Gilt covered this subject to some level. He basically said do not start with Microservice, build them when your domain complexity warrants it. He went through his own experience and how they improved upon the ball of mud to nice discreet service and then how they exploded the number of services when their
So takeaways (with some personal salt and pepper) I would say is do NOT consider Microservice if:
  • you do not have the organisation structure (small cross functional teams)
  • you are not practising Devops, automated build and deployment
  • you do not have (or cannot have) an uber monitoring system telling you exactly what is happening
  • you have to carry along a legacy database
  • your domain is not too big

Microservices is an evolutionary process

Randy Shoup explained how the process towards Microservice has been an evolutionary one, usually starting with the Monolith. So he stressed "Evolution, not intelligent design" and how in such an environment, Governance (oh yeah, all ye Enterprise Architects listen up) is not the same as traditional SOA and is decentralised with its adoption purely based on how useful a practice/ is.

Optimised message protocols now a must

Frequented in more than a couple of talks, moving to ProtoBuf, Avro, Thrift or similar seems to be a must in all but trivial Microservice setups. One of the main performance challenges in MS scenarios is network latency and cost of serialisation/deserialisation over and over across multiple hops and JSON simply does not cut it anymore

Source: Thrift vs Protobuf comparison (
Be ready to move your APIs to these message protocols - yes you lose some simplicity benefits but trading it off for performance is always a necessary evil to make. Rest assured nothing stops you to use JSON while developing and testing, but if your game is serious, start changing your protocols now - and I am too, item already added to the technical backlog.

What I was hoping to hear about and did not

Microservice registry and versioning best practices was not mentioned at all. I tried to quiz a few speakers on these but did not quite get a good answer. I suppose the space is open for grab.

Need for Composition Services/APIs

As experienced personally, in an MS environment you would end up with two different types of services: Functional Microservice where they own their data and are the authority in their business domain and Composition APIs which do not normally own any data and bring value by composing data from several other services - normally involving some level of business logic affecting the end user. In DDD terms, you could somehow find similarity with Facade services and Yoni used the word "mid-tier services".

Composition services can bring a lot of value when it comes to caching, pagination of data and enriching the information. They practically scatter the requests and gather the results back and compose the result - Fan-out is another term used here.

By inherently depending on many services, they are notoriously susceptible to performance outliers (will be discussed in the second post) and failure scenarios which might warrant a layered cache backed by soft storage with a higher expiry for fallback in case dependent service is down.

In the next post, we will look into topics below. We will discover why Docker in fact is closely related to the Microservices - and it is not what you think! [Do I qualify now to become a BusinessInsider journalist?]
  • Those pesky performance outliers
  • Containers, containers
  • Don't beat the dead Agile
  • Extra large memory computing is now a thing

Thursday, 1 January 2015

Future of Programming - Rise of the Scientific Programmer (and fall of the craftsman)

Level [C3]

[Disclaimer: I am by no means a Scientific Programmer but I just want to become one] It is the turn of yet another year and the time is ripe for the last year reviews, predictions for the new year and its resolutions. Last year I made some bold statements and made some radical decisions to start transitioning. I picked up a Mac, learnt some Python and Bash and a year on, I think it was good and really enjoyed it. Still (as I predicted), I spent most of my time writing C#. [working on a Reactive Cloud Actor micro-Framework, in case for any reason it interests you]. Now a year on, Microsoft is a different company: new CEO, moving towards Open Source and embracing non-Windows operating systems. So how it is going to shift the innovation imbalance is a wait-and-see. But anyway, that was last year and is behind us.

Now let's talk about 2015. And perhaps programming in general. Are you sick of hearing Big Data buzzwords? Do you believe Data Science is a pile of mumbo jumbo to bamboozle us and actually used by a teeny tiny number of companies, and producing value even less? IoT is just another hype? I hope by reading the below, I would have been able to answer you. Sorry, no TL;DR

*     *     *

It was a warm, sunny and all around really nice day in June. The year is 2007 and I am on a University day trip (and punting) to Cambridge along with my classmates many of whom are at least 15 years younger than me. Punting is fun but as a part time student this is one of the few times I have a leisurely access to our Image Processing lecturer - a bright and young guy - again younger than me. And I open the discussion with how we have not moved much since the 80s in the field of Artificial Intelligence. We improve and optimise algorithms but there is no game-changing giant leap. And he argues the state of the art usually improves little by little.

"Day out punting in cambridge"

Next year, we work on a project involving some machine learning to recognise road markings. I spend a lot of time on feature extraction and use a 2 layer Neural Network since I get the best result out of it compared to 3. I am told not to use many layers of neurons as it usually gets stuck on a local minima during training - I actually tried and saw it. Overall the result was OK but it involved many pre- and post- processing techniques to achieve acceptable recognition.

*     *     *

I wake up and it is 2014. Many Universities, research organisations (and companies) across the world have successfully implemented Deep Learning using Deep Neural Networks - which have many layers of neurons. Watson answers all the questions in Double Jeopardy. Object Recognition from image is almost a solved case - with essentially no feature extraction.

A Deep Neural Network
Perhaps my lecturer was right: with improving training algorithms and providing many many labeled data, we suddenly have a big leap in science (or was I right?!). It seems that for the first time implementation has got ahead of the mathematics: we do not fully understand why Deep Learning works - but it works. And when they fail, we still don't know why they fail.

And guess what, industry and the academia have not been this close for a long time.

And what has all this got to do with us? Rise of the machine intelligence is going to change programming. Forever.

*     *     *

Honestly, I am sick of the amount of bickering and fanboyism that goes today in the programming world. The culture of "nah... I don't like this" or "ahhh... that is s..t" or "ah that is a killer" is what has plagued our community. One day Angular is super hot next week it is the worst thing. Be it zsh or Bash. Be in vim vs. Emacs vs. Sublime Text vs Visual Studio. Be it Ruby, Node.js, Scala, Java, C#, you name it. And same goes for technologies such as MongoDB, Redis... subjectivism instead of facts. As if we forgot we came from the line of scientists.

Like children we get attached to new toys and with the attention span of a goldfish, instead of solving real world problems, ruminate over on how we can improve our coding experience. We are ninjas and what we do no one can do. And we can do whatever we want to do.

"I have got power"

Yes, we are lucky. A 23-year old kid with a couple of years of programming experience can earn double of what a 45-year old retail manager with 20 years of experience earns annually. And what we do with that money? spend all of it on booze, specialty burgers, travelling and conferences, gadgets - basically whatever we want to.

But those who remember the first .com crash, can tell you it has not always been like this. In fact, back in 2001-2002 it was really hard to get a job. And the problem was, there were many really good candidates. IT industry became almost impenetrable since there was this catch-22 of requiring job experience to get the job experience. But anyway, the good ones, the stubborn ones and those with little talent but a lot of passion (includes me) stayed on for the good days that we have now. Reality was many programmers of the time had read "Access in 24 hours" and landed a fat salary in a big company. And on the other hand, projects were failing since we spent most of our time writing documentation. The industry had to weed out bad coders and inefficient practices.

And we have software craftsmanship movement and agile practices.

*     *     *

The opposition has already started. You might have seen discussions DHH has had with Kent Beck and Martin Fowler on TDD. I do not agree 100% with Erik Meijer says here (only 90%) but there is a lot of truth in it. We have replaced fact-based data-backed attitude with a faith-based wishy-washy peace-hug-freedom hippie agile way, forcing us mechanically to follow some steps and believe that it will be good for us. Agile has taken us a long way from where we started at the turn of the century, but there are problems. From personal experience, I see no difference in the quality of developers who do TDD and do not. And to be frank, I actually see negative effect, people who do TDD do not fully think hard about the consequence of the code they write - I know this could be inflammatory but hand on heart, that is my experience.  I think TDD and agile has given us a safety net that as a tightrope walker, instead of focusing on our walking technique, we improve the safety net. As long as we do the motions, we are safe. Unit tests, coverage, planning poker, retrospective, definition of done, Story, task, creating tickets, moving tickets. How many bad programmers have you seen that are masters of agile?

You know what? It is the mediocrity we have been against all the time. Mediocre developers who in the first .com boom got into the market by taking a class or reading a book are back in a different shape: those who know how to be opinionated, look cool, play the game and take the paycheck. We are in another .com boom now, and if there is a crash, sadly they are out - even if it includes me.

*     *     *

I think we have neglected the scientific side of our jobs. Our maths is rusty and those who did study ComSci do not remember a lot of what they read. We cannot calculate the complexity of our code and fall to the trap that machines are fast now - yes it didn't matter for a time but when you are dealing with petabytes of data and pay by processing hours? When our team first started working on recommendations, the naive implementation took 1000 node for 2 days, now the implementation uses 24 nodes for a few hours, and perhaps this is still way way too much.

"we are craftsmen and craftswomen"

But really, since when did our job look like a craftsman (a carpenter)? We are Ninjas? And we do code Kata to keep our skills/swords sharp. This is all gone too far into the world of phantasy. The world of warcraft. This is now a New Age full-blown religion.

What an utter rubbish.

*     *     *

Now back on earth, languages of the 90s and early 2000 are on the decline. Java, C#, C++ all on the decline. But they are being replaced by other languages such as Scala right? I leave that to you to decide based on the diagram below. 
Google trends of "Java", "Scala", "C#" and "Python Programming" (so that it does not get mixed up with Python the snake) - source: google
The only counter trend is Python. The recent rise in Python popularity is what I call "rise of the scientific programmer" - and that is just one of the signs. Python is a very popular language in the academic space. It is easy to pick up works everywhere and has some functional aspects making it terse. But that is not all: it sits on top of a huge wealth of scientific libraries and it can talk to Java and C as well. Industry innovations have started to come straight from the Universities. From the early 2000s where the academia seemed completely irrelevant to now where it leads the innovation. PySpark has come fully from the heart of Berkeley's University. Many of the contributors to Hadoop code and its wide ecosystem are in the academia.

We are now in need of people who can scientifically argue about algorithms and data (is coding anything but code+data?) and most of them could implement an algorithm given the paper or mathematical notation. And guess what, this is the trend for jobs with "Machine Learning":
Trend of jobs containing "Machine Learning" - Source: ITJobsWatch

And this is really not just Hadoop. According to the source above Machine learning jobs have had 41% rise from 2013 to 2014 while hadoop jobs had only 16%.

This Deep Learning thing is real. It is already here. All those existing algorithms need to be polished and integrated with the new concepts and some will be just replaced. If you can give interactions of a person with a site to a deep network, it can predict with a high confidence whether they are gonna buy, leave or indecisive. It can find patterns in diseases that we as humans cannot. This is what we were waiting for (and we were afraid of?). Machine intelligence is here.

The scientific Programmer [And yes, it has to know more]

Now one might say that the answer is the Data Scientists. True. But first, we don't have enough of them and second, based on first hand experience, we need people with engineering rigour to produce production ready software - something that certainly some Data Scientist have but not all. So I feel that a programmer turned Statistician can build a more robust software than the other way around. We need people who understand what it takes to build a software that you can put in front of millions of customers to use. People who understand linear scalability, SLA, monitoring and architectural constraints.

*     *     *

Horizon is shifting.

We can pick a new language (be it Go, Haskell, Julia, Rust, Elixir or Erlang) and start re-inventing the wheel and start from pretty much the same scratch again because hey, this is easy now, we have done it before and don't have to think. We can pick a new albeit cleaner abstraction and re-implement thousands of hours of hard work and sweat we and the community have suffered - since hey we can. We can rewrite the same HTTP pipeline 1000s of different ways and never be happy with what we have achieved, be it Ruby on Rails, Sinatra, Nancy, ASP.NET Web API, Flask, etc. And keep happy that we are striving for that perfection, that unicorn. We can argue about how to version APIs and how a service is such RESTful and such not RESTful. We can mull over pettiest of things such as semicolon or the gender of a pronoun and let insanely clever people leave our community. We can exchange the worst of words over "females in the industry" while we more or less are saying the same thing, Too much drama.

But soon this will be no good. Not good enough. We got to grow up and go back to school, relearn all about Maths, statistics, and generally scientific reasoning. We need to man up and re-learn that being a good coder has nothing to do with the number of stickers you have at the back of your Mac. It is all scientific - we come from a long line of scientists, we have got to live up to our heritage.

We need to go and build novelties for the second half of the decade. This is what I hope to be able to do.

Saturday, 29 November 2014

Health Endpoint in API Design: slippery slope that it is

Level [C3]

Health Endpoint is a common practice in building APIs. Such an endpoint, unlike other resources of a REST API, instead of achieving a business activity, returns the status of the service and while it can gather and return some data, it is the HTTP status that defines whether the service is "Up or Down". These endpoints commonly go and check a bunch configurations and connectivity with the dependent services, and even make a few calls for a "Test Customer" to make sure business activity can be achieved.

There is something above that just doesn't feel right to me - and this post is an exercise to define what I mean by it. I will explain what are the problems with the Health API and I am going to suggest how to "fix" it.

What is the health of an API anyway? The server up and running and capable of returning the status 200? Server and all its dependencies running and returning 200? Server and all its dependencies running capable of returning 200 in a reasonable amount of time? API able to accomplish some business activity? Or API able to accomplish a certain activity for a test user? API able to accomplish all activities within reasonable time? API able to accomplish all activities with its 95% percentile falling within an agreed SLA?

A Service is a complex beast. While its complexity would be nowhere near a living organism, it is useful to draw a parallel with a living organism. I remember from my previous medical life that the definition of health - provided by none other than WHO - would go like this:
"Health is a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity."
In other words, defining health of an organism is a complex and involved process requiring deep understanding of the organism and how it functions. [Well, we are lucky that we are only dealing with distributed systems and their services (or MicroServices if you like) and not living organisms.] For servies, instead of health, we define the Quality of Service as a quantitative measure of a service's health.

Quality Of Servie is normally a bunch of orthogonal SLAs each defining a measurement for one aspect of the service. In terms of monitoring, Availability of a service is the most important aspect of the service to guage and closely follow. Availability of the service cannot simply be measured by the amount of time the servers dedicated to a service have been up. Apart from being reachable, service needs to respond within acceptable time (Low Latency) and has to be able to achieve its business activity (Functional) - no point server being reachable and return 503 error within milliseconds. So the number of error responses (as a deviation from the baseline which can be normal validation and business rule errors) also come into play.

So the question is how can we, expose an endpoint inside a service that can aggregate all above facets and report the health of a service. Simple answer is we cannot and should not commit ourselves to do it. Why? Let's take some simple help from algebra.
API/Service maps an input domain to an output domain (codomain). Also availability is a function of the output domain.

A service (f) is basically a function that maps the input domain (I) to an output domain (O). So:
O = f(I)
The output domain is a set of all possible responses with their status codes and latencies. Availability (A) is a function (a) of the output domain since it has to aggregate errors, latencies, etc:
A = a(O)
So in other words:
A = a(f(I))
So in other words, A cannot be measured without I - which for a real service is a very large set. And also it needs all of f - not your subset bypass-authentication-use-test-customer method.

So one approach is to sit outside the service and only deal with the output domain in a sort of proxy or monitoring server logs. Netflix have done a ton of work on this and have open sourced it as Hysterix) and no wonder I have not heard anything about the magical Health Endpoint in there (now there is an alternative endpoint which I will explain later). But if you want to do it within the service you need all the input domain and not just your "Test Customer" to make assertions about the health of your service. And this kind of assertion is not just wrong, it is dangerous as I am going to explain.

First of all, gradually - especially as far as the ops are concerned - that green line on the dashboard that checks your endpoint becomes your availability. People get used to trust it and when things go wrong out there and customers jump and shout, you will not believe it for quite a while because your eye sees that green line and trusts it.

And guess what happens when you have such an incident? There will be a post-mortem meeting and all tie-and-suits will be there and they identify the root cause as the faulty health-check and you will be asked to go back and fix your Health Check endpoint. And then you start building more and more complexity into your endpoint. Your endpoint gets to know about each and every dependency, all their intricacies. And before your know it, you could build a complete application beside your main service. And you know what, you have to do it for each and every service, as they are all different.

So don't do it. Don't commit yourself to what you cannot achieve.

So is there no point in having a simplistic endpoint which tells us basic information about the status of the service? Of course there is. Such information are useful and many load balancers or web proxies require such an endpoint.

But first we need to make absolutely clear what the responsibility of such an endpoint is.

Canary Endpoint

A canary endpoint (the name is courtesy of Jamie Beaumont) is a simplistic endpoint which gathers connectivity status and latency of all dependencies of a service. It absolutely does not trigger any business activity, there is no "Test Customer" of any kind and is not a "Health Endpoint". If it is green, it does not mean your service is available. But if it is red (your canary is dead) then you definitely have a problem.

So how does a canary endpoint work? It basically checks connectivity with its immediate dependencies - including but not limited to:
  • External services
  • SQL Databases
  • NoSQL Stores
  • External distributed caches
  • Service brokers (Azure RabbitMQ, Service Bus)
A canary result contains name of the dependency, latency and the status code. If any of the results has non-success code, endpoint returns a non-success code. Status code returned is used by simple callers such as load balancers. Also in all cases, we return a payload which is aggregated canary result. Such results can be used to feed various charts and draw heuristics into significance of variability of the latencies.

You probably noticed that External Services appear in Italic i.e. it is a bit different. Reason is if an external service has a canary endpoint itself, instead of just a connectivity check, we call its canary endpoint and add its aggregated result to the result we are returning. So usually the entry point API will generate a cascade of canary chirps that will tell us how things are.

Implementation of the connectivity check is generally dependent on the underlying technology. For a Cache service, it suffices to Set a constant value and see it succeeding. For a SQL Database a SELECT 1; query is all that is needed. For an Azure Storage account, it would be enough to connect and get the list of tables. The point being here is that none of these are anywhere near a business activity, so that you could not - in the remotest sense - think that its success means your business is up and running.

So there you have it. Don't do health endpoints, do canary instead.

Canary Endpoint implementation

A canary endpoint normally gets implemented as an HTTP GET call which returns a collection of connectivity check metrics. You can abstract the logic of checking various dependencies in a library and allow API developers to implement the endpoint by just declaring the dependencies.

We are currently working on an implementation in ASOS (C# and ASP.NET Web API) and there is possibility of open sourcing it.

Security of the Canary Endpoint

I am in favour of securing Canary Endpoint with a constant API key - normally under SSL. This does not provide highest level of security but it is enough to make it much more difficult to break into. At the end of the day, a canay endpoint lists all internal dependencies, components and potentially technologies of a system that can be used by hackers to target components.

Performance impact of Canary Endpoint

Since canary endpoint does not trigger any business activity, its performance footprint should be minimal. However, since calling the canary endpoint generates a cascade of calls, it might not be wise to iterate through all canary endpoints and just call them every few seconds since deeper canary endpoints in a highly layered architecture get called multiple times in each round. 

Sunday, 19 October 2014

Performance Series - How poor performance of HttpContent.ReadAsAsync can affect your API/site

Level [T2]

This has been a revelation - what I am about to reveal here, deeply surprised me - it might surprise you too. This post is mainly about consuming restful APIs using HttpClient and when the payload is JSON.

UPDATE: I got in touch with the ASP.NET team and they confirmed this as a performance bug which has now been fixed but the fix yet not available.

As you probably know performance and benchmarking is very close to my heart and I have been recently focusing on benchmarking a few APIs at work. One of my observations was that the Web APIs/Web Sites which have historically been IO-bound, they show sign of CPU strain and have become CPU-bound.

When you think logically about it, there is no magic here: by using async/await, you end up putting your CPU into some use unlike the old times when the threads are blocked waiting for the IO to return and CPU would be twiddling its thumb. However, I found the CPU overhead of the operations excessive so I set out to benchmark a few different scenarios.

Test Setup

Two APIs were created where one was using the other. These two APIs were part of the same cloud service which was deployed to two separate Medium (A2) web roles. I used 2 different deployments of the same code, one dependent upon version 4.0.30506.0 of the API and the ther one with the latest version which was 5.2.2. Difference between two versions of the Web API is the topic of another post, but the differences were not huge although newer versions showed improved performance.

API being called returns a customer with its orders. Every customer has between 1 to 3 orders and each order between 1-3 items. On the long run, these randomisation gets evened out. Each document returned is between 1-2 KB. So the more superficial API, for every customer, makes one call to get the customer and for each customer will separately call the deeper API once for each order. Then it combines the result and sends back the response. Both APIs are deployed in the same Azure Data Centre.

You can find the whole code at GitHub. The code takes 4 different approaches as below:

public class CustomerController : ApiController
    public FullCustomer GetSync(int id)
        var webClient = new WebClient();
        var customerString = webClient.DownloadString(BuildUrl(id));
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
            var orderString = webClient.DownloadString(BuildUrl(id, orderId));
            var order = JsonConvert.DeserializeObject<Order>(orderString);
        fullCustomer.Orders = orders;
        return fullCustomer;

    public async Task<FullCustomer> GetASync(int id)
        var webClient = new WebClient();
        var customerString = await webClient.DownloadStringTaskAsync(BuildUrl(id));
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
            var orderString = await webClient.DownloadStringTaskAsync(BuildUrl(id, orderId));
            var order = JsonConvert.DeserializeObject<Order>(orderString);
        fullCustomer.Orders = orders;
        return fullCustomer;

    public async Task<FullCustomer> GetASyncWebApi(int id)
        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("Accept", "application/json"); 
        var responseMessage = await httpClient.GetAsync(BuildUrl(id));
        var customer = await responseMessage.Content.ReadAsAsync<Customer>();
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
            responseMessage = await httpClient.GetAsync(BuildUrl(id, orderId));
            var order = await responseMessage.Content.ReadAsAsync<Order>();
        fullCustomer.Orders = orders;
        return fullCustomer;

    public async Task<FullCustomer> GetASyncWebApiString(int id)
        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("Accept", "application/json"); 
        var responseMessage = await httpClient.GetAsync(BuildUrl(id));
        var customerString = await responseMessage.Content.ReadAsStringAsync();
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
            responseMessage = await httpClient.GetAsync(BuildUrl(id, orderId));
            var orderString = await responseMessage.Content.ReadAsStringAsync();
            var order = JsonConvert.DeserializeObject<Order>(orderString);
        fullCustomer.Orders = orders;
        return fullCustomer;

    private string BuildUrl(int customerId, int? orderId = null)
        string baseUrl = string.Format("http://{0}:8080/api/customer/{1}", Request.RequestUri.Host, customerId);
        return orderId.HasValue
            ? string.Format("{0}/order/{1}", baseUrl, orderId.Value)
            : baseUrl;

So as you can see, we use 4 different methods:

1) Using WebClient in the sync fashion
2) Using WebClient in the async fashion
3) Using HttpClient in the async fashion with ReadAsAsync on HttpContent
4) Using HttpClient in the async fashion with reading content as string and then using JsonConvert to deserialise

I used SuperBenchmarker to invoke the main API which gathers the data from the other API. I used the tool within the same Azure Data Centre from another machine (none of the APIs) to make the tests more realistic yet eliminate network idiosyncrasies.

I used 5000 requests with concurrency of 10 - although I tried other number as well which did not make any material difference in the results.


Here is the result for scenario 1 (sync using WebClient):

TPS:    394 (requests/second)
Max:    199ms
Min:    8ms
Avg:    25ms

50%     below 24ms
60%     below 25ms
70%     below 27ms
80%     below 28ms
90%     below 30ms
95%     below 32ms
98%     below 36ms
99%     below 55ms
99.9%   below 185ms

The result for scenario 2 (Async using WebClient) usually shows better throughput but higher CPU

TPS:    485 (requests/second)
Max:    291ms
Min:    5ms
Avg:    20ms

50%     below 19ms
60%     below 21ms
70%     below 23ms
80%     below 25ms
90%     below 27ms
95%     below 29ms
98%     below 32ms
99%     below 36ms
99.9%   below 284ms

The CPU difference is not huge and can be explained by the increase throughput:

CPU usage during Scenario 1 and 2

Now what surprised me greatly was the result of the third scenario (using HttpContent.ReadAsAsync<T>). Apart from CPU of 100% and signs of queueing, here is the result:

TPS:    41 (requests/second)
Max:    12656ms
Min:    26ms
Avg:    240ms

50%     below 170ms
60%     below 178ms
70%     below 187ms
80%     below 205ms
90%     below 256ms
95%     below 296ms
98%     below 370ms
99%     below 3181ms
99.9%   below 12573ms

Yeah, shocking. The diagram below compares CPU usage between scenario 1 and 3:

CPU usage in scenario 1 (arrow) and 3 (box)

Scenario 4 is definitely better and is not too far from scenario 1 and 2:

TPS:    230 (requests/second)
Max:    7068ms
Min:    7ms
Avg:    43ms

50%     below 20ms
60%     below 22ms
70%     below 24ms
80%     below 26ms
90%     below 29ms
95%     below 34ms
98%     below 110ms
99%     below 144ms
99.9%   below 7036ms

The CPU usage is around 80% and definitely worse that scenario 1 and 2 (which requires further analysis).


Where is the problem? It appears that JSON Deserialization when reading from a stream is not efficient. It is possible that the JSON Deserialization has to optimise for memory efficiency rather than CPU efficiency since when the whole string is passed, it is surely much faster. 

Profiling proves that the problem is indeed JSON Deserialization:

Profiling scenario 3 is showing that the most of the CPU time is spent in JSON Deserialisation

So in order to prove that, we do not have to invoke an API. The whole operation can be done inside a Console application. So I used the same code that was generating customers and orders. Here I am comparing

private static void Main(string[] args)
    const int TotalRun = 10*1000;

    var customerController = new CustomerController();
    var orderController = new OrderController();
    var customer = customerController.Get(1);

    var orders = new List<Order>();
    foreach (var orderId in customer.OrderIds)
        orders.Add(orderController.Get(1, orderId));

    var fullCustomer = new FullCustomer(customer)
        Orders = orders

    var s = JsonConvert.SerializeObject(fullCustomer);
    var bytes = Encoding.UTF8.GetBytes(s);
    var stream = new MemoryStream(bytes);
    var content = new StreamContent(stream);

    content.Headers.ContentType = new MediaTypeHeaderValue("application/json");

    var stopwatch = Stopwatch.StartNew();
    for (int i = 1; i < TotalRun+1; i++)
        var a = content.ReadAsAsync<FullCustomer>().Result;
        if(i % 100 == 0)
            Console.Write("\r" + i);

    for (int i = 1; i < TotalRun+1; i++)
        var sa = content.ReadAsStringAsync().Result;
        var a = JsonConvert.DeserializeObject<FullCustomer>(sa);
        if (i % 100 == 0)
            Console.Write("\r" + i);



As expected, the result shows uncomparable difference, in the order of ~120:


So this result basically confirms what we have seen. I will get in touch with James Newton King and try to shed more light on the subject.


HttpContent.ReadAsAsync on JSON payloads is really slow - in the order of 120x compared to JsonConvert. I guess it might to do with the memory efficiency of reading from streams (keeping memory footprint at zero)  but that is a guess and I have been in touch with James Newton King (creator of Json.Net) to get to the bottom of it.

For the meantime, if you know your content is not going to be huge and always in JSON, you might as well forget about content negotiation and read it as a string and then use JsonConvert to deserialize.

Thursday, 16 October 2014

SuperBenchmarker v0.4 released

Level [T2]

This is a quick shoutout on the release of version 0.4 of SuperBenchmarker, a Web and/or Web API performance benchmarking command line tool for Windows.

You might have heard about and used Apache Benchmark (ab.exe) in the past which is a very useful tool but on Windows it is very limited (e.g cannot make POST, PUT, etc requests and only supports GET). SuperBenchmarker (sb.exe) supports PUT, DELETE, POST or any arbitrary method and allows you to parameterise the URL and headers using a data file, a .NET DLL plugin and the new feature is the randomisation feature which removes the need for any setup when all needed is random data.

Getting started

The best way to get SuperBenchmarker is to use awesome Chocolatey which is Windows' equivalent of apt-get tool on Linux.

To get Chocolatey, just run this command in your Powershell console (in Administrative mode):
iex ((new-object net.webclient).DownloadString(''))
And then install SuperBenchmarker in the command line shell:
c:\> cinst SuperBenchmarker
And now you are ready to load test:
c:\> sb -u
Note: if you are using Visual Studio's command line shell, you cannot use ampersand character (&) and you have to escape it using hat (^).

Using SuperBenchmarker

Normally you would define total number of requests and concurrency:
c:\> sb -u -c 10 -n 2000
Statement above runs 2000 requests with concurrency of 10. At the end, you are shown important metrics of the test:
Status 503:    1768
Status 200:    232

TPS: 98 (requests/second)
Max: 11271.1890515802ms
Min: 3.15724613377097ms
Avg: 497.181240820346ms

50% below 34.0499543844287ms
60% below 41.8178295863705ms
70% below 48.7612961478952ms
80% below 87.4385213898198ms
90% below 490.947293319644ms
So the breakdown of the statuses returned, TPS (transaction per second), minimum, maximum and average of the time taken. But more importantly, your percentiles that really should be driving your performance SLAs (90% or 99%). [Never use the average for anything].

In case you need to dig deeper, a log file gets created in the current directory with the name run.log which you can change using -l parameter:
c:\> sb -u -c 10 -n 2000 -l c:\temp\mylog.txt
log file is a tab separated file which contains these columns: order number (based on the time started not the time ended), status code, time taken in ms and then any custom parameters that you might have had - see below.

Sometimes when running a test for the first time, something might not have been quite right in which case you can make a dry run/debug using -d parameter that makes a single request and the body of the response will be shown at the end. If you want to see the headers as well, use -h parameter.
c:\> sb -u -c 10 -n 2000 -d -h

Supplying request headers or a payload for POST, PUT and DELETE

In order to pass your tailored request headers, a template file needs to be defined which is basically the HTTP request template (minus the first line defining verb and URL and version):
c:\> sb -u -t template.txt
And the template.txt contains our custom headers (from the second line of the HTTP request):
User-Agent: SuperBenchmarker
MyCustomHeader: foo-bar;baz=biz
Please note that you don't have to provide headers such as Host and Content-Length - in fact it will raise errors. These headers will be automatically added by the underlying framework.

For using POST, PUT and DELETE we need to supply the verb parameter:
c:\> sb -u -v POST
But this request would require a payload as well which we need to supply. We use the template file to supply HTTP payload as well as any headers. Similar to an HTTP request, there must be an empty line between headers and body:
User-Agent: WhateverValueIWant
Content-Type: x-www-formurlencoded


Parameterising your requests

Basically you can parameterise your requests using a CSV file containing values, your plugin DLL or by specifying randomisation.

You would define parameters in URL and headers (payload not yet supported but coming soon in 0.5) using SuperBenchmarker's syntax:
As you can see, we use three curly brackets to denote a parameter. For example the statement below defines a customerId parameter:
c:\> sb -u "{{{customerId}}}^&ignore=false"
Please note quoting the URL and use of ^ to escape & character - if you are using Visual Studio command prompt. To run the test successfully, you need to provide a CSV file containing customerId:
and use -f option to run the test:
c:\> sb -u "{{{customerId}}}&ignore=false" -f c:\mypath\values.csv
Alternatively, you can use a plugin DLL to provide values:
c:\> sb -u "{{{customerId}}}&ignore=false" -p c:\mypath\myplugin.dll
This DLL must have a single public class implementing IValueProvider interface which has a single method:
public interface IValueProvider
    IDictionary<string, object> GetValues(int index);
For every request implementation of the interface is called and the index of the request is passed to and in return a dictionary of field names with their respective values is passed back.

Now we have a new feature that in most cases alleviates the need for CSV file or plugin and that is the ability to setup random value provider in the definition of the parameter itself:
c:\> sb -u "{{{customerId:RAND_INTEGER:[1000:2000]}}}&ignore=false"
The parameter above is set up to be filled by a random integer between 1000 and 2000.
Possible value types are:
  • String: using RAND_STRING. Will output random words
  • Date: using RAND_DATE (accepts range)
  • DateTime: using RAND_DATETIME (accepts range)
  • DateTimeOffset: using RAND_DATETIMEOFFSET which outputs ISO dates (accepts range)
  • Double: using RAND_DOUBLE (accepts range)
  • Name: using RAND_NAME. Will output random names


Don't forget to feedback with issues and feature requests in the GitHub page. Happy load testing!

Sunday, 5 October 2014

What should I do?

Level [C1]

TLDR; : I was charged for a huge egress on one of my VMs and I have no way of knowing what caused it or whether it was an infrastructure glitch nothing to do with VM.

OK, here is the snippet of the last email I received back:

"I understand what you’re saying. Because this involves a non-windows VM, we wouldn’t be able to determine what caused this. we can only validate the usage, and as you already know, the data usage seems quite appropriate, comparing to our logs. Had this been a Windows machine, we could have engaged another team(s) to have this matter looked into. As of now, I am afraid, this is all we have. You might want to check with Ubuntu support to see what has caused this."

The story started two weeks ago. I have, you know, MSDN account courtesy of my work which provides around £95/mo free Windows Azure credit - for which I am really grateful. It has allowed me to run some kinda pre-startup stuff on a shoestring. I recently realised my free credit can take you so far so started using Azure services more liberally knowing that I am going to be charged. At the end of the day, nothing valuable comes out of nothing. But before doing that, I also registered for AWS and as you know, it provides some level of free services which I again took advantage of.

But I have not said anything about the problem yet. It was around the end of the month and I knew my remaining credit would be enough to carry me to the next month. Then I noticed my credit panel turning orange from green (this is quite handy, telling you with the rate of usage you will soon run out of credit) which I thought was bizarre and then next day I realised all my services had disappeared. Totally gone! Bang! I had run out of credit...

This was a Saturday and I spent Saturday and Sunday reinstating my services. So I learnt the lesson that I need remove spending cap, which is not the reason why you read this. The reason I ran out of credit was due to egress (=data out) from one of my Linux boxes... so this box used to have an egress of a few MB to max few hundred MB a day and suddenly shoot up to 175GB and 186GB! OK, either there is a mistake or my box has been hacked into - with the latter more likely.

Here is the egress from that "renegade" Linux box:
8/30/2014 "Data Transfer Out (GB)" "GB" 0.004967
8/31/2014 "Data Transfer Out (GB)" "GB" 0.006748
9/1/2014 "Data Transfer Out (GB)" "GB" 0.001735
9/2/2014 "Data Transfer Out (GB)" "GB" 0.17618
9/3/2014 "Data Transfer Out (GB)" "GB" 0.003499
9/4/2014 "Data Transfer Out (GB)" "GB" 0.013394
9/5/2014 "Data Transfer Out (GB)" "GB" 0.016147
9/6/2014 "Data Transfer Out (GB)" "GB" 0.005412
9/7/2014 "Data Transfer Out (GB)" "GB" 0.005803
9/8/2014 "Data Transfer Out (GB)" "GB" 0.001547
9/9/2014 "Data Transfer Out (GB)" "GB" 0.003044
9/10/2014 "Data Transfer Out (GB)" "GB" 0.002179
9/11/2014 "Data Transfer Out (GB)" "GB" 0.02876
9/12/2014 "Data Transfer Out (GB)" "GB" 0.008922
9/13/2014 "Data Transfer Out (GB)" "GB" 0.28983
9/14/2014 "Data Transfer Out (GB)" "GB" 0.099229
9/15/2014 "Data Transfer Out (GB)" "GB" 0.002653
9/16/2014 "Data Transfer Out (GB)" "GB" 0.00191
9/17/2014 "Data Transfer Out (GB)" "GB" 0.00182
9/18/2014 "Data Transfer Out (GB)" "GB" 175.69292
9/19/2014 "Data Transfer Out (GB)" "GB" 182.974478

This box was running an ElasticSearch instance which had barely 1GB of data. And yes, it was not protected so it could have been hacked into. So what I did, with a bunch of bash commands which I conveniently copied and pasted from google searches, was to create a list files that were changed on the box ordered by the date and send to the support. There was nothing suspicious there - and the support team did not find it any more useful [in fact the comment was that it was "poorly formatted", I assume due to the difference in new line character in linux :) ].

So it seemed less likely that it was hacked but maybe someone has been running queries against the ElasticSearch which had been secured only by its obscurity. But hang on! If that were the case, the ingress should somehow correspond:
8/30/2014 "Data Transfer In (GB)" "GB" 0.004335
8/31/2014 "Data Transfer In (GB)" "GB" 0.005579
9/1/2014 "Data Transfer In (GB)" "GB" 0.000744
9/2/2014 "Data Transfer In (GB)" "GB" 0.021571
9/3/2014 "Data Transfer In (GB)" "GB" 0.002983
9/4/2014 "Data Transfer In (GB)" "GB" 0.002571
9/5/2014 "Data Transfer In (GB)" "GB" 0.002961
9/6/2014 "Data Transfer In (GB)" "GB" 0.001994
9/7/2014 "Data Transfer In (GB)" "GB" 0.001642
9/8/2014 "Data Transfer In (GB)" "GB" 0.000483
9/9/2014 "Data Transfer In (GB)" "GB" 0.001879
9/10/2014 "Data Transfer In (GB)" "GB" 0.002022
9/11/2014 "Data Transfer In (GB)" "GB" 0.017067
9/12/2014 "Data Transfer In (GB)" "GB" 0.002644
9/13/2014 "Data Transfer In (GB)" "GB" 0.347959
9/14/2014 "Data Transfer In (GB)" "GB" 0.089146
9/15/2014 "Data Transfer In (GB)" "GB" 0.000404
9/16/2014 "Data Transfer In (GB)" "GB" 0.001912
9/17/2014 "Data Transfer In (GB)" "GB" 0.001733
9/18/2014 "Data Transfer In (GB)" "GB" 0.012967
9/19/2014 "Data Transfer In (GB)" "GB" 0.021446

which it does in all days other than 18th and 19th. Which made me think, it was perhaps all a mistake and maybe an Azure infrastructure agent or something has gone out of control and started doing this.

So I asked the support to start investigating the issue. And it took a week to get back to me and the investigation provided only the hourly breakdown (and I was hoping for more, perhaps some kind of explanation or identifying the IP address all this egress was going). The pattern is also bizarre. For example on 19th (at the end of which my credit ran out):
2014-09-18T00:00:00 2014-09-18T01:00:00 DataTrOut 166428 External
2014-09-18T01:00:00 2014-09-18T02:00:00 DataTrOut 374040 External
2014-09-18T02:00:00 2014-09-18T03:00:00 DataTrOut 2588121384 External
2014-09-18T03:00:00 2014-09-18T04:00:00 DataTrOut 539993671 External
2014-09-18T04:00:00 2014-09-18T05:00:00 DataTrOut 1128216 External
2014-09-18T05:00:00 2014-09-18T06:00:00 DataTrOut 25462 External
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 18308 AM2
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 63250 External
2014-09-18T07:00:00 2014-09-18T08:00:00 DataTrOut 24588 External
2014-09-18T08:00:00 2014-09-18T09:00:00 DataTrOut 82296 External
2014-09-18T09:00:00 2014-09-18T10:00:00 DataTrOut 59362 External
2014-09-18T10:00:00 2014-09-18T11:00:00 DataTrOut 10573316727 External
2014-09-18T11:00:00 2014-09-18T12:00:00 DataTrOut 11443247791 External
2014-09-18T12:00:00 2014-09-18T13:00:00 DataTrOut 13854724048 External
2014-09-18T13:00:00 2014-09-18T14:00:00 DataTrOut 8115190263 External
2014-09-18T14:00:00 2014-09-18T15:00:00 DataTrOut 13748807057 External
2014-09-18T15:00:00 2014-09-18T16:00:00 DataTrOut 10389478694 External
2014-09-18T16:00:00 2014-09-18T17:00:00 DataTrOut 19979259451 External
2014-09-18T17:00:00 2014-09-18T18:00:00 DataTrOut 21398993891 External
2014-09-18T18:00:00 2014-09-18T19:00:00 DataTrOut 22843598777 External
2014-09-18T19:00:00 2014-09-18T20:00:00 DataTrOut 23087199863 External
2014-09-18T20:00:00 2014-09-18T21:00:00 DataTrOut 16958070173 External
2014-09-18T21:00:00 2014-09-18T22:00:00 DataTrOut 13126214430 External
2014-09-18T22:00:00 2014-09-18T23:00:00 DataTrOut 352327 External
2014-09-18T23:00:00 2014-09-19T00:00:00 DataTrOut 358377 External

So what should I do?

So first of all, I have now put the ElasticSearch box behind a proxy and access to it requires authentication with the proxy. And better to do it now rather than later. And the ES box now is protected by IPSec.

But really the big question is, when you are on cloud and you don't own any of the infrastructure or its monitoring, how can you make sure you are being charged fairly. My £40 bill for the egress is not huge but makes me wonder, what if it happens again? What would I do?

There are also other questions: would that have been different on another provider? I am not really sure [although at least they could have opened a file with Linux line ending :) ] but the usage of a cloud platform requires building a trust relationship which is essential. I really appreciate the general attitude of Azure (and Microsoft) towards Open Source in embracing everything non-Windows and I think it is the right direction, but I think the support model should be also developed in line with that. AWS is a more mature platform but have you seen anything like this there? And if yes, how was your experience?