Saturday 29 November 2014

Health Endpoint in API Design: slippery slope that it is

Level [C3]

Health Endpoint is a common practice in building APIs. Such an endpoint, unlike other resources of a REST API, instead of achieving a business activity, returns the status of the service and while it can gather and return some data, it is the HTTP status that defines whether the service is "Up or Down". These endpoints commonly go and check a bunch configurations and connectivity with the dependent services, and even make a few calls for a "Test Customer" to make sure business activity can be achieved.

There is something above that just doesn't feel right to me - and this post is an exercise to define what I mean by it. I will explain what are the problems with the Health API and I am going to suggest how to "fix" it.

What is the health of an API anyway? The server up and running and capable of returning the status 200? Server and all its dependencies running and returning 200? Server and all its dependencies running capable of returning 200 in a reasonable amount of time? API able to accomplish some business activity? Or API able to accomplish a certain activity for a test user? API able to accomplish all activities within reasonable time? API able to accomplish all activities with its 95% percentile falling within an agreed SLA?

A Service is a complex beast. While its complexity would be nowhere near a living organism, it is useful to draw a parallel with a living organism. I remember from my previous medical life that the definition of health - provided by none other than WHO - would go like this:
"Health is a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity."
In other words, defining health of an organism is a complex and involved process requiring deep understanding of the organism and how it functions. [Well, we are lucky that we are only dealing with distributed systems and their services (or MicroServices if you like) and not living organisms.] For servies, instead of health, we define the Quality of Service as a quantitative measure of a service's health.

Quality Of Servie is normally a bunch of orthogonal SLAs each defining a measurement for one aspect of the service. In terms of monitoring, Availability of a service is the most important aspect of the service to guage and closely follow. Availability of the service cannot simply be measured by the amount of time the servers dedicated to a service have been up. Apart from being reachable, service needs to respond within acceptable time (Low Latency) and has to be able to achieve its business activity (Functional) - no point server being reachable and return 503 error within milliseconds. So the number of error responses (as a deviation from the baseline which can be normal validation and business rule errors) also come into play.

So the question is how can we, expose an endpoint inside a service that can aggregate all above facets and report the health of a service. Simple answer is we cannot and should not commit ourselves to do it. Why? Let's take some simple help from algebra.
API/Service maps an input domain to an output domain (codomain). Also availability is a function of the output domain.

A service (f) is basically a function that maps the input domain (I) to an output domain (O). So:
O = f(I)
The output domain is a set of all possible responses with their status codes and latencies. Availability (A) is a function (a) of the output domain since it has to aggregate errors, latencies, etc:
A = a(O)
So in other words:
A = a(f(I))
So in other words, A cannot be measured without I - which for a real service is a very large set. And also it needs all of f - not your subset bypass-authentication-use-test-customer method.

So one approach is to sit outside the service and only deal with the output domain in a sort of proxy or monitoring server logs. Netflix have done a ton of work on this and have open sourced it as Hysterix) and no wonder I have not heard anything about the magical Health Endpoint in there (now there is an alternative endpoint which I will explain later). But if you want to do it within the service you need all the input domain and not just your "Test Customer" to make assertions about the health of your service. And this kind of assertion is not just wrong, it is dangerous as I am going to explain.

First of all, gradually - especially as far as the ops are concerned - that green line on the dashboard that checks your endpoint becomes your availability. People get used to trust it and when things go wrong out there and customers jump and shout, you will not believe it for quite a while because your eye sees that green line and trusts it.

And guess what happens when you have such an incident? There will be a post-mortem meeting and all tie-and-suits will be there and they identify the root cause as the faulty health-check and you will be asked to go back and fix your Health Check endpoint. And then you start building more and more complexity into your endpoint. Your endpoint gets to know about each and every dependency, all their intricacies. And before your know it, you could build a complete application beside your main service. And you know what, you have to do it for each and every service, as they are all different.

So don't do it. Don't commit yourself to what you cannot achieve.

So is there no point in having a simplistic endpoint which tells us basic information about the status of the service? Of course there is. Such information are useful and many load balancers or web proxies require such an endpoint.

But first we need to make absolutely clear what the responsibility of such an endpoint is.

Canary Endpoint

A canary endpoint (the name is courtesy of Jamie Beaumont) is a simplistic endpoint which gathers connectivity status and latency of all dependencies of a service. It absolutely does not trigger any business activity, there is no "Test Customer" of any kind and is not a "Health Endpoint". If it is green, it does not mean your service is available. But if it is red (your canary is dead) then you definitely have a problem.



So how does a canary endpoint work? It basically checks connectivity with its immediate dependencies - including but not limited to:
  • External services
  • SQL Databases
  • NoSQL Stores
  • External distributed caches
  • Service brokers (Azure RabbitMQ, Service Bus)
A canary result contains name of the dependency, latency and the status code. If any of the results has non-success code, endpoint returns a non-success code. Status code returned is used by simple callers such as load balancers. Also in all cases, we return a payload which is aggregated canary result. Such results can be used to feed various charts and draw heuristics into significance of variability of the latencies.

You probably noticed that External Services appear in Italic i.e. it is a bit different. Reason is if an external service has a canary endpoint itself, instead of just a connectivity check, we call its canary endpoint and add its aggregated result to the result we are returning. So usually the entry point API will generate a cascade of canary chirps that will tell us how things are.

Implementation of the connectivity check is generally dependent on the underlying technology. For a Cache service, it suffices to Set a constant value and see it succeeding. For a SQL Database a SELECT 1; query is all that is needed. For an Azure Storage account, it would be enough to connect and get the list of tables. The point being here is that none of these are anywhere near a business activity, so that you could not - in the remotest sense - think that its success means your business is up and running.

So there you have it. Don't do health endpoints, do canary instead.

Canary Endpoint implementation

A canary endpoint normally gets implemented as an HTTP GET call which returns a collection of connectivity check metrics. You can abstract the logic of checking various dependencies in a library and allow API developers to implement the endpoint by just declaring the dependencies.

We are currently working on an implementation in ASOS (C# and ASP.NET Web API) and there is possibility of open sourcing it.

Security of the Canary Endpoint

I am in favour of securing Canary Endpoint with a constant API key - normally under SSL. This does not provide highest level of security but it is enough to make it much more difficult to break into. At the end of the day, a canay endpoint lists all internal dependencies, components and potentially technologies of a system that can be used by hackers to target components.

Performance impact of Canary Endpoint

Since canary endpoint does not trigger any business activity, its performance footprint should be minimal. However, since calling the canary endpoint generates a cascade of calls, it might not be wise to iterate through all canary endpoints and just call them every few seconds since deeper canary endpoints in a highly layered architecture get called multiple times in each round. 


Sunday 19 October 2014

Performance Series - How poor performance of HttpContent.ReadAsAsync can affect your API/site

Level [T2]

This has been a revelation - what I am about to reveal here, deeply surprised me - it might surprise you too. This post is mainly about consuming restful APIs using HttpClient and when the payload is JSON.

UPDATE: I got in touch with the ASP.NET team and they confirmed this as a performance bug which has now been fixed but the fix yet not available.

As you probably know performance and benchmarking is very close to my heart and I have been recently focusing on benchmarking a few APIs at work. One of my observations was that the Web APIs/Web Sites which have historically been IO-bound, they show sign of CPU strain and have become CPU-bound.

When you think logically about it, there is no magic here: by using async/await, you end up putting your CPU into some use unlike the old times when the threads are blocked waiting for the IO to return and CPU would be twiddling its thumb. However, I found the CPU overhead of the operations excessive so I set out to benchmark a few different scenarios.

Test Setup

Two APIs were created where one was using the other. These two APIs were part of the same cloud service which was deployed to two separate Medium (A2) web roles. I used 2 different deployments of the same code, one dependent upon version 4.0.30506.0 of the API and the ther one with the latest version which was 5.2.2. Difference between two versions of the Web API is the topic of another post, but the differences were not huge although newer versions showed improved performance.

API being called returns a customer with its orders. Every customer has between 1 to 3 orders and each order between 1-3 items. On the long run, these randomisation gets evened out. Each document returned is between 1-2 KB. So the more superficial API, for every customer, makes one call to get the customer and for each customer will separately call the deeper API once for each order. Then it combines the result and sends back the response. Both APIs are deployed in the same Azure Data Centre.

You can find the whole code at GitHub. The code takes 4 different approaches as below:

public class CustomerController : ApiController
{
    public FullCustomer GetSync(int id)
    {
        var webClient = new WebClient();
        var customerString = webClient.DownloadString(BuildUrl(id));
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
        {
            var orderString = webClient.DownloadString(BuildUrl(id, orderId));
            var order = JsonConvert.DeserializeObject<Order>(orderString);
            orders.Add(order);
        }
        fullCustomer.Orders = orders;
        return fullCustomer;
    }

    public async Task<FullCustomer> GetASync(int id)
    {
        var webClient = new WebClient();
        var customerString = await webClient.DownloadStringTaskAsync(BuildUrl(id));
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
        {
            var orderString = await webClient.DownloadStringTaskAsync(BuildUrl(id, orderId));
            var order = JsonConvert.DeserializeObject<Order>(orderString);
            orders.Add(order);
        }
        fullCustomer.Orders = orders;
        return fullCustomer;
    }

    public async Task<FullCustomer> GetASyncWebApi(int id)
    {
        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("Accept", "application/json"); 
        var responseMessage = await httpClient.GetAsync(BuildUrl(id));
        var customer = await responseMessage.Content.ReadAsAsync<Customer>();
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
        {
            responseMessage = await httpClient.GetAsync(BuildUrl(id, orderId));
            var order = await responseMessage.Content.ReadAsAsync<Order>();
            orders.Add(order);
        }
        fullCustomer.Orders = orders;
        return fullCustomer;
    }

    public async Task<FullCustomer> GetASyncWebApiString(int id)
    {
        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Add("Accept", "application/json"); 
        var responseMessage = await httpClient.GetAsync(BuildUrl(id));
        var customerString = await responseMessage.Content.ReadAsStringAsync();
        var customer = JsonConvert.DeserializeObject<Customer>(customerString);
        var fullCustomer = new FullCustomer(customer);
        var orders = new List<Order>();
        foreach (var orderId in customer.OrderIds)
        {
            responseMessage = await httpClient.GetAsync(BuildUrl(id, orderId));
            var orderString = await responseMessage.Content.ReadAsStringAsync();
            var order = JsonConvert.DeserializeObject<Order>(orderString);
            orders.Add(order);
        }
        fullCustomer.Orders = orders;
        return fullCustomer;
    }

    private string BuildUrl(int customerId, int? orderId = null)
    {
        string baseUrl = string.Format("http://{0}:8080/api/customer/{1}", Request.RequestUri.Host, customerId);
        return orderId.HasValue
            ? string.Format("{0}/order/{1}", baseUrl, orderId.Value)
            : baseUrl;
    }

}
So as you can see, we use 4 different methods:

1) Using WebClient in the sync fashion
2) Using WebClient in the async fashion
3) Using HttpClient in the async fashion with ReadAsAsync on HttpContent
4) Using HttpClient in the async fashion with reading content as string and then using JsonConvert to deserialise

I used SuperBenchmarker to invoke the main API which gathers the data from the other API. I used the tool within the same Azure Data Centre from another machine (none of the APIs) to make the tests more realistic yet eliminate network idiosyncrasies.

I used 5000 requests with concurrency of 10 - although I tried other number as well which did not make any material difference in the results.

Results

Here is the result for scenario 1 (sync using WebClient):

TPS:    394 (requests/second)
Max:    199ms
Min:    8ms
Avg:    25ms

50%     below 24ms
60%     below 25ms
70%     below 27ms
80%     below 28ms
90%     below 30ms
95%     below 32ms
98%     below 36ms
99%     below 55ms
99.9%   below 185ms


The result for scenario 2 (Async using WebClient) usually shows better throughput but higher CPU

TPS:    485 (requests/second)
Max:    291ms
Min:    5ms
Avg:    20ms

50%     below 19ms
60%     below 21ms
70%     below 23ms
80%     below 25ms
90%     below 27ms
95%     below 29ms
98%     below 32ms
99%     below 36ms
99.9%   below 284ms

The CPU difference is not huge and can be explained by the increase throughput:

CPU usage during Scenario 1 and 2

Now what surprised me greatly was the result of the third scenario (using HttpContent.ReadAsAsync<T>). Apart from CPU of 100% and signs of queueing, here is the result:

TPS:    41 (requests/second)
Max:    12656ms
Min:    26ms
Avg:    240ms

50%     below 170ms
60%     below 178ms
70%     below 187ms
80%     below 205ms
90%     below 256ms
95%     below 296ms
98%     below 370ms
99%     below 3181ms
99.9%   below 12573ms

Yeah, shocking. The diagram below compares CPU usage between scenario 1 and 3:

CPU usage in scenario 1 (arrow) and 3 (box)

Scenario 4 is definitely better and is not too far from scenario 1 and 2:

TPS:    230 (requests/second)
Max:    7068ms
Min:    7ms
Avg:    43ms

50%     below 20ms
60%     below 22ms
70%     below 24ms
80%     below 26ms
90%     below 29ms
95%     below 34ms
98%     below 110ms
99%     below 144ms
99.9%   below 7036ms

The CPU usage is around 80% and definitely worse that scenario 1 and 2 (which requires further analysis).

Analysis

Where is the problem? It appears that JSON Deserialization when reading from a stream is not efficient. It is possible that the JSON Deserialization has to optimise for memory efficiency rather than CPU efficiency since when the whole string is passed, it is surely much faster. 

Profiling proves that the problem is indeed JSON Deserialization:

Profiling scenario 3 is showing that the most of the CPU time is spent in JSON Deserialisation

So in order to prove that, we do not have to invoke an API. The whole operation can be done inside a Console application. So I used the same code that was generating customers and orders. Here I am comparing

private static void Main(string[] args)
{
    const int TotalRun = 10*1000;

    var customerController = new CustomerController();
    var orderController = new OrderController();
    var customer = customerController.Get(1);

    var orders = new List<Order>();
    foreach (var orderId in customer.OrderIds)
    {
        orders.Add(orderController.Get(1, orderId));
    }

    var fullCustomer = new FullCustomer(customer)
    {
        Orders = orders
    };

    var s = JsonConvert.SerializeObject(fullCustomer);
    var bytes = Encoding.UTF8.GetBytes(s);
    var stream = new MemoryStream(bytes);
    var content = new StreamContent(stream);

    content.Headers.ContentType = new MediaTypeHeaderValue("application/json");

            
    var stopwatch = Stopwatch.StartNew();
    for (int i = 1; i < TotalRun+1; i++)
    {
        var a = content.ReadAsAsync<FullCustomer>().Result;
        if(i % 100 == 0)
            Console.Write("\r" + i);
    }
    Console.WriteLine();
    Console.WriteLine(stopwatch.Elapsed);

    stopwatch.Restart();
    for (int i = 1; i < TotalRun+1; i++)
    {
        var sa = content.ReadAsStringAsync().Result;
        var a = JsonConvert.DeserializeObject<FullCustomer>(sa);
        if (i % 100 == 0)
            Console.Write("\r" + i);
    }
    Console.WriteLine();
    Console.WriteLine(stopwatch.Elapsed);

    Console.Read();

}

As expected, the result shows uncomparable difference, in the order of ~120:

10000
00:00:06.2345493
10000
00:00:00.0509763

So this result basically confirms what we have seen. I will get in touch with James Newton King and try to shed more light on the subject.

Conclusion

HttpContent.ReadAsAsync on JSON payloads is really slow - in the order of 120x compared to JsonConvert. I guess it might to do with the memory efficiency of reading from streams (keeping memory footprint at zero)  but that is a guess and I have been in touch with James Newton King (creator of Json.Net) to get to the bottom of it.

For the meantime, if you know your content is not going to be huge and always in JSON, you might as well forget about content negotiation and read it as a string and then use JsonConvert to deserialize.

Thursday 16 October 2014

SuperBenchmarker v0.4 released

Level [T2]

This is a quick shoutout on the release of version 0.4 of SuperBenchmarker, a Web and/or Web API performance benchmarking command line tool for Windows.

You might have heard about and used Apache Benchmark (ab.exe) in the past which is a very useful tool but on Windows it is very limited (e.g cannot make POST, PUT, etc requests and only supports GET). SuperBenchmarker (sb.exe) supports PUT, DELETE, POST or any arbitrary method and allows you to parameterise the URL and headers using a data file, a .NET DLL plugin and the new feature is the randomisation feature which removes the need for any setup when all needed is random data.


Getting started

The best way to get SuperBenchmarker is to use awesome Chocolatey which is Windows' equivalent of apt-get tool on Linux.

To get Chocolatey, just run this command in your Powershell console (in Administrative mode):
iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))
And then install SuperBenchmarker in the command line shell:
c:\> cinst SuperBenchmarker
And now you are ready to load test:
c:\> sb -u http://google.com
Note: if you are using Visual Studio's command line shell, you cannot use ampersand character (&) and you have to escape it using hat (^).

Using SuperBenchmarker

Normally you would define total number of requests and concurrency:
c:\> sb -u http://google.com -c 10 -n 2000
Statement above runs 2000 requests with concurrency of 10. At the end, you are shown important metrics of the test:
Status 503:    1768
Status 200:    232

TPS: 98 (requests/second)
Max: 11271.1890515802ms
Min: 3.15724613377097ms
Avg: 497.181240820346ms

50% below 34.0499543844287ms
60% below 41.8178295863705ms
70% below 48.7612961478952ms
80% below 87.4385213898198ms
90% below 490.947293319644ms
So the breakdown of the statuses returned, TPS (transaction per second), minimum, maximum and average of the time taken. But more importantly, your percentiles that really should be driving your performance SLAs (90% or 99%). [Never use the average for anything].

In case you need to dig deeper, a log file gets created in the current directory with the name run.log which you can change using -l parameter:
c:\> sb -u http://google.com -c 10 -n 2000 -l c:\temp\mylog.txt
log file is a tab separated file which contains these columns: order number (based on the time started not the time ended), status code, time taken in ms and then any custom parameters that you might have had - see below.

Sometimes when running a test for the first time, something might not have been quite right in which case you can make a dry run/debug using -d parameter that makes a single request and the body of the response will be shown at the end. If you want to see the headers as well, use -h parameter.
c:\> sb -u http://google.com -c 10 -n 2000 -d -h

Supplying request headers or a payload for POST, PUT and DELETE

In order to pass your tailored request headers, a template file needs to be defined which is basically the HTTP request template (minus the first line defining verb and URL and version):
c:\> sb -u http://google.com -t template.txt
And the template.txt contains our custom headers (from the second line of the HTTP request):
User-Agent: SuperBenchmarker
MyCustomHeader: foo-bar;baz=biz
Please note that you don't have to provide headers such as Host and Content-Length - in fact it will raise errors. These headers will be automatically added by the underlying framework.

For using POST, PUT and DELETE we need to supply the verb parameter:
c:\> sb -u http://google.com -v POST
But this request would require a payload as well which we need to supply. We use the template file to supply HTTP payload as well as any headers. Similar to an HTTP request, there must be an empty line between headers and body:
User-Agent: WhateverValueIWant
Content-Type: x-www-formurlencoded

name=value&age=25

Parameterising your requests

Basically you can parameterise your requests using a CSV file containing values, your plugin DLL or by specifying randomisation.

You would define parameters in URL and headers (payload not yet supported but coming soon in 0.5) using SuperBenchmarker's syntax:
{{{MyParameter}}}
As you can see, we use three curly brackets to denote a parameter. For example the statement below defines a customerId parameter:
c:\> sb -u "http://myserver.com/api/customer?customerid={{{customerId}}}^&ignore=false"
Please note quoting the URL and use of ^ to escape & character - if you are using Visual Studio command prompt. To run the test successfully, you need to provide a CSV file containing customerId:
customerId
123,
245,
and use -f option to run the test:
c:\> sb -u "http://myserver.com/api/customer?customerid={{{customerId}}}&ignore=false" -f c:\mypath\values.csv
Alternatively, you can use a plugin DLL to provide values:
c:\> sb -u "http://myserver.com/api/customer?customerid={{{customerId}}}&ignore=false" -p c:\mypath\myplugin.dll
This DLL must have a single public class implementing IValueProvider interface which has a single method:
public interface IValueProvider
{
    IDictionary<string, object> GetValues(int index);
}
For every request implementation of the interface is called and the index of the request is passed to and in return a dictionary of field names with their respective values is passed back.

Now we have a new feature that in most cases alleviates the need for CSV file or plugin and that is the ability to setup random value provider in the definition of the parameter itself:
c:\> sb -u "http://myserver.com/api/customer?customerid={{{customerId:RAND_INTEGER:[1000:2000]}}}&ignore=false"
The parameter above is set up to be filled by a random integer between 1000 and 2000.
Possible value types are:
  • String: using RAND_STRING. Will output random words
  • Date: using RAND_DATE (accepts range)
  • DateTime: using RAND_DATETIME (accepts range)
  • DateTimeOffset: using RAND_DATETIMEOFFSET which outputs ISO dates (accepts range)
  • Double: using RAND_DOUBLE (accepts range)
  • Name: using RAND_NAME. Will output random names

Feedback

Don't forget to feedback with issues and feature requests in the GitHub page. Happy load testing!

Sunday 5 October 2014

What should I do?

Level [C1]

TLDR; : I was charged for a huge egress on one of my VMs and I have no way of knowing what caused it or whether it was an infrastructure glitch nothing to do with VM.

OK, here is the snippet of the last email I received back:

"I understand what you’re saying. Because this involves a non-windows VM, we wouldn’t be able to determine what caused this. we can only validate the usage, and as you already know, the data usage seems quite appropriate, comparing to our logs. Had this been a Windows machine, we could have engaged another team(s) to have this matter looked into. As of now, I am afraid, this is all we have. You might want to check with Ubuntu support to see what has caused this."

The story started two weeks ago. I have, you know, MSDN account courtesy of my work which provides around £95/mo free Windows Azure credit - for which I am really grateful. It has allowed me to run some kinda pre-startup stuff on a shoestring. I recently realised my free credit can take you so far so started using Azure services more liberally knowing that I am going to be charged. At the end of the day, nothing valuable comes out of nothing. But before doing that, I also registered for AWS and as you know, it provides some level of free services which I again took advantage of.

But I have not said anything about the problem yet. It was around the end of the month and I knew my remaining credit would be enough to carry me to the next month. Then I noticed my credit panel turning orange from green (this is quite handy, telling you with the rate of usage you will soon run out of credit) which I thought was bizarre and then next day I realised all my services had disappeared. Totally gone! Bang! I had run out of credit...

This was a Saturday and I spent Saturday and Sunday reinstating my services. So I learnt the lesson that I need remove spending cap, which is not the reason why you read this. The reason I ran out of credit was due to egress (=data out) from one of my Linux boxes... so this box used to have an egress of a few MB to max few hundred MB a day and suddenly shoot up to 175GB and 186GB! OK, either there is a mistake or my box has been hacked into - with the latter more likely.

Here is the egress from that "renegade" Linux box:
8/30/2014 "Data Transfer Out (GB)" "GB" 0.004967
8/31/2014 "Data Transfer Out (GB)" "GB" 0.006748
9/1/2014 "Data Transfer Out (GB)" "GB" 0.001735
9/2/2014 "Data Transfer Out (GB)" "GB" 0.17618
9/3/2014 "Data Transfer Out (GB)" "GB" 0.003499
9/4/2014 "Data Transfer Out (GB)" "GB" 0.013394
9/5/2014 "Data Transfer Out (GB)" "GB" 0.016147
9/6/2014 "Data Transfer Out (GB)" "GB" 0.005412
9/7/2014 "Data Transfer Out (GB)" "GB" 0.005803
9/8/2014 "Data Transfer Out (GB)" "GB" 0.001547
9/9/2014 "Data Transfer Out (GB)" "GB" 0.003044
9/10/2014 "Data Transfer Out (GB)" "GB" 0.002179
9/11/2014 "Data Transfer Out (GB)" "GB" 0.02876
9/12/2014 "Data Transfer Out (GB)" "GB" 0.008922
9/13/2014 "Data Transfer Out (GB)" "GB" 0.28983
9/14/2014 "Data Transfer Out (GB)" "GB" 0.099229
9/15/2014 "Data Transfer Out (GB)" "GB" 0.002653
9/16/2014 "Data Transfer Out (GB)" "GB" 0.00191
9/17/2014 "Data Transfer Out (GB)" "GB" 0.00182
9/18/2014 "Data Transfer Out (GB)" "GB" 175.69292
9/19/2014 "Data Transfer Out (GB)" "GB" 182.974478

This box was running an ElasticSearch instance which had barely 1GB of data. And yes, it was not protected so it could have been hacked into. So what I did, with a bunch of bash commands which I conveniently copied and pasted from google searches, was to create a list files that were changed on the box ordered by the date and send to the support. There was nothing suspicious there - and the support team did not find it any more useful [in fact the comment was that it was "poorly formatted", I assume due to the difference in new line character in linux :) ].

So it seemed less likely that it was hacked but maybe someone has been running queries against the ElasticSearch which had been secured only by its obscurity. But hang on! If that were the case, the ingress should somehow correspond:
8/30/2014 "Data Transfer In (GB)" "GB" 0.004335
8/31/2014 "Data Transfer In (GB)" "GB" 0.005579
9/1/2014 "Data Transfer In (GB)" "GB" 0.000744
9/2/2014 "Data Transfer In (GB)" "GB" 0.021571
9/3/2014 "Data Transfer In (GB)" "GB" 0.002983
9/4/2014 "Data Transfer In (GB)" "GB" 0.002571
9/5/2014 "Data Transfer In (GB)" "GB" 0.002961
9/6/2014 "Data Transfer In (GB)" "GB" 0.001994
9/7/2014 "Data Transfer In (GB)" "GB" 0.001642
9/8/2014 "Data Transfer In (GB)" "GB" 0.000483
9/9/2014 "Data Transfer In (GB)" "GB" 0.001879
9/10/2014 "Data Transfer In (GB)" "GB" 0.002022
9/11/2014 "Data Transfer In (GB)" "GB" 0.017067
9/12/2014 "Data Transfer In (GB)" "GB" 0.002644
9/13/2014 "Data Transfer In (GB)" "GB" 0.347959
9/14/2014 "Data Transfer In (GB)" "GB" 0.089146
9/15/2014 "Data Transfer In (GB)" "GB" 0.000404
9/16/2014 "Data Transfer In (GB)" "GB" 0.001912
9/17/2014 "Data Transfer In (GB)" "GB" 0.001733
9/18/2014 "Data Transfer In (GB)" "GB" 0.012967
9/19/2014 "Data Transfer In (GB)" "GB" 0.021446

which it does in all days other than 18th and 19th. Which made me think, it was perhaps all a mistake and maybe an Azure infrastructure agent or something has gone out of control and started doing this.

So I asked the support to start investigating the issue. And it took a week to get back to me and the investigation provided only the hourly breakdown (and I was hoping for more, perhaps some kind of explanation or identifying the IP address all this egress was going). The pattern is also bizarre. For example on 19th (at the end of which my credit ran out):
2014-09-18T00:00:00 2014-09-18T01:00:00 DataTrOut 166428 External
2014-09-18T01:00:00 2014-09-18T02:00:00 DataTrOut 374040 External
2014-09-18T02:00:00 2014-09-18T03:00:00 DataTrOut 2588121384 External
2014-09-18T03:00:00 2014-09-18T04:00:00 DataTrOut 539993671 External
2014-09-18T04:00:00 2014-09-18T05:00:00 DataTrOut 1128216 External
2014-09-18T05:00:00 2014-09-18T06:00:00 DataTrOut 25462 External
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 18308 AM2
2014-09-18T06:00:00 2014-09-18T07:00:00 DataTrOut 63250 External
2014-09-18T07:00:00 2014-09-18T08:00:00 DataTrOut 24588 External
2014-09-18T08:00:00 2014-09-18T09:00:00 DataTrOut 82296 External
2014-09-18T09:00:00 2014-09-18T10:00:00 DataTrOut 59362 External
2014-09-18T10:00:00 2014-09-18T11:00:00 DataTrOut 10573316727 External
2014-09-18T11:00:00 2014-09-18T12:00:00 DataTrOut 11443247791 External
2014-09-18T12:00:00 2014-09-18T13:00:00 DataTrOut 13854724048 External
2014-09-18T13:00:00 2014-09-18T14:00:00 DataTrOut 8115190263 External
2014-09-18T14:00:00 2014-09-18T15:00:00 DataTrOut 13748807057 External
2014-09-18T15:00:00 2014-09-18T16:00:00 DataTrOut 10389478694 External
2014-09-18T16:00:00 2014-09-18T17:00:00 DataTrOut 19979259451 External
2014-09-18T17:00:00 2014-09-18T18:00:00 DataTrOut 21398993891 External
2014-09-18T18:00:00 2014-09-18T19:00:00 DataTrOut 22843598777 External
2014-09-18T19:00:00 2014-09-18T20:00:00 DataTrOut 23087199863 External
2014-09-18T20:00:00 2014-09-18T21:00:00 DataTrOut 16958070173 External
2014-09-18T21:00:00 2014-09-18T22:00:00 DataTrOut 13126214430 External
2014-09-18T22:00:00 2014-09-18T23:00:00 DataTrOut 352327 External
2014-09-18T23:00:00 2014-09-19T00:00:00 DataTrOut 358377 External

So what should I do?

So first of all, I have now put the ElasticSearch box behind a proxy and access to it requires authentication with the proxy. And better to do it now rather than later. And the ES box now is protected by IPSec.

But really the big question is, when you are on cloud and you don't own any of the infrastructure or its monitoring, how can you make sure you are being charged fairly. My £40 bill for the egress is not huge but makes me wonder, what if it happens again? What would I do?

There are also other questions: would that have been different on another provider? I am not really sure [although at least they could have opened a file with Linux line ending :) ] but the usage of a cloud platform requires building a trust relationship which is essential. I really appreciate the general attitude of Azure (and Microsoft) towards Open Source in embracing everything non-Windows and I think it is the right direction, but I think the support model should be also developed in line with that. AWS is a more mature platform but have you seen anything like this there? And if yes, how was your experience?


Monday 29 September 2014

Performance Counters for your HttpClient

Level [T2]

Pure HTTP APIs (aka REST APIs) are very popular at the moment. If you are building/maintaining one, you have probably learnt (perhaps the hard way) that having a monitoring on your API is one of your top cross-cutting concerns. This monitoring involves different aspects of the API, one of which is the performance.

There has been many approaches to solving cross-cutting concerns on APIs. Proxying has been a popular one and there has been proliferation of proxy type services such as Mashery or Apigee which basically sit in front of your API and provide an abstraction which can solve your security and access control, monetizing or performance monitoring.

This is a popular approach but comes with its headaches. One of these problems is that if you already have a security mechanism in place, it can clash with the one provided by the service. Also the geographic distribution of these services, although are getting better, is not as good as the one provided by many cloud vendors. This can mean that your traffic could be bouncing across the Atlantic ocean a couple of times before getting to your end users - and this is bad, really really bad. On the other hand, these will not tell you what is happening inside your application which you have to solve differently using classic monitoring approaches. So in a sense, I would say you might as well just byte! the bullet and just implement it yourself.

PerfIt was a library I built a couple of years ago to provide performance counters for ASP.NET Web API. Creating and managing performance counters for windows is not a rocket science but is clumsy and once you do it over and over and for every service, this is just a bit too much overhead. So this was designed to make it really simple for you... well, actually for myself :) I have been using PerfIt over the last two years - in production - and it serves the purpose.

Now, it is all well and good to know what is the performance characteristics of your API. But this gets really more complicated when you have taken a dependency on other APIs and degradation of your API is the result of performance issues in your dependent APIs.




This is really a blame game: considering the fact that each MicroServie is managed by a single team and in an ideal DevOps world, the developers must support their services and you would love to blame another team rather than yourself especially if this is truly the direct result of performance degradation in a dependent service.

One solution is have access to performance metrics of the dependent APIs but really this might not be possible and kinda goes against the DevOps model of operations. On the other hand, what if this is due to an issue in an intermediary - such as a Proxy?

The real solution is to benchmark and monitor the calls you are making out of your API. And I have implemented a new DelegatingHandler to do that measurement for you!


PerfIt! for HttpClient

So HttpClient is the de-facto class for accessing HTTP APIs. If you are using something else then either you have a really really good reason to do so or you are just doing it wrong.

PerfIt for client provides 4 standard counters out of the box:

  • Total # of operations
  • Average time taken (in seconds)
  • Time taken for the last operation (in ms)
  • # of operations per second

These are the 4 counters that you would normally need. If you need another, just get in touch with me but remember these counters must be business-independent.

First step is to install PerfIt using NuGet:
PM> Install-Package PerfIt
And then you just need to install the counters for your application. This can be done by running this simple code (categoryName is the performance counter grouping):
PerfItRuntime.InstallStandardCounters("<categoryName>");
Or by using an installer class as explained on the GitHub page and then running InstallUtil.exe.

Now, just add PerfitClientDelegatingHandler to your HttpClient and make some requests against a couple of websites:
using System;
using System.Net.Http;
using PerfIt;
using RandomGen;

namespace PerfitClientTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var httpClient = new HttpClient(new PerfitClientDelegatingHandler("ClientTest")
            {
                InnerHandler = new HttpClientHandler()
            });

            var randomSites = Gen.Random.Items(new[]
            {
                "http://google.com",
                "http://yahoo.com",
                "http://github.com"
            });
            for (int i = 0; i < 100; i++)
            {
                var httpResponseMessage = httpClient.GetAsync(randomSites()).Result;  
                Console.Write("\r" + i);  
            }
           
        }
    }
}
And now you should be seeing this (we have chosen "ClientTest" for the category name):

So as you can see, instance names are the host names of the APIs and this should provide you with enough information for you to monitor your dependencies. Any deeper information than this and then you really need tracing rather than monitoring - which is a completely different thing...



So as you can see, it is extremely easy to set this up and run it. I might expose the part of the code that defines the instance name which will probably be coming in the next versions.

Please use the GitHub page to ask questions or provide feedbacks.


Wednesday 6 August 2014

Thank you Microsoft, nine months on ...

Level [C1]

I felt that almost a year after my blog post Thank you Microsoft and so long, it is a right time to look back and contemplate on the decision I made back then. If you have not read the post, well in a nutshell, I decided to gradually move towards non-Microsoft technologies - mainly due to Microsoft's lack of innovation, especially in the Big Data space.

A few things have changed since then. TLDR; Generally it really feels I made the right decision to diversify. Personally, I have learnt a lot and at the same time I had a lot of fun. I have built a bridge to the other side and I can easily communicate with non-Microsoft peers and translate my skills. On the technology landscape, however, there has been some major changes that makes me feel having a hybrid skillset is much more important than a complete shift to any particular platform.

I will first look at the technology landscape as it stands now and then will share my personal journey so far in adopting alternative platforms and technologies.

A point on the predictions

OK, I made some predictions in the previous post such as "In 5 years, all data problems become Big Data problems". Some felt this is completely wrong - which could be - and left some not very nice messages. At the end of the day, predictions are free (and that is why I like them) and you could do the same. I am sharing my views, take it as it is worth for you. I have a track record of making predictions, some came true and some did not. I predicted a massive financial crash in 2011 which did not happen and lead to one of the biggest bull markets ever (well my view is they pumped money into the economy and artificially made the bull market) and I lost some money. On the other hand back in 2010 I predicted in my StackOverflow profile something that I think it is called Internet Of Things now, so I guess I was lucky (by the way, I am predicting a financial crash in the next couple of months). Anyway, take it easy :)

Technology Horizon

The New Microsoft

Since I wrote the blog post, a lot has changed, especially in Microsoft. It now has a new CEO and a radically different view on the Open Source. Releasing the source of a big chunk of the .NET Framework is harbinger of a shift whose size is difficult to guess at the moment. Is it mere a gesture? I doubt it, this adoption was the result of years of internal campaign from the likes of Phil Haack and Scott Hanselman and it has finally worked its way up the hierarchy.

But adopting Open Source is not just a community service gesture, it has an important financial significance. With the rate of change in the industry, you need to keep an army of developers to constantly work and improve products at this scale. No company is big enough on its own to build what can be built by an organic and healthy ecosystem. So crowd-sourcing is an important technique to improve your product without paying for the time spent. It is probably true that the community around your product is the real IP of most cloud platforms and not so much the actual code.

Microsoft is also relinquishing its push strategy towards its Operating System and to be honest, I am not surprised at all. Many have talked about the WebOS but reality is we have already had it for the last couple of years. Your small smartphone or tablet come to life when they are connected - enabling you to do most of what you can do on the laptop/pc with the only limitation being the screen size. On the other hand, Microsoft has released the web version of the office and to be fair it is capable of doing pretty much everything you can do in the desktop versions, and sometimes it does it better. So for the majority of consumers, all you need is the WebOS. It feels that the value of a desktop operating system becomes of less and less importance when most of the applications you use daily are web-based or cloud-based.


Cloud and Azure

I have been doing a lot of Azure both at work and outside it. Apart from HDInsight, I think Azure is expanding at a phenomenal rate in both feature and reliability and this is where I feel Microsoft is closing in the Innovation Gap. It is mind-blowing to look at the list of new features that are coming out of Azure every month.

Focusing mainly on the PaaS products, I think future of Azure in terms of adoption by the smaller companies is looking more and more attractive compared to AWS which has traditionally been IaaS platform of choice. Companies like Netflix have built all their software empire on AWS but they had an army of great developers to write the tooling and integration stuff.

All-in-all I feel Azure is here to stay and might even overtake AWS in the next 5 years. What will be a decider is the innovation pace.

Non-Hadoop platforms

A recent trend that could change the balance is the proliferation of non-Hadoop approaches to Big Data which will favour Microsoft and Google. With Hadoop 2.0 trying to abstract away even more the algorithm from the resource management, I think there is an opportunity for Microsoft to jump in and plug-in a whole host of Microsoft languages in a real way - it was possible to use C# and F# before but no one really used it.

Microsoft announced the release AzureML which is the PaaS offering of Machine Learning on the Azure Platform. It is early to say but it looks like this could be used for smaller-than-big-date machine learning and analysis. This platform is basically productionising of the Machine Learning platform behind the Bing search engine.

Another sign that the Hadoop's elephant is getting old is Google's announcement to drop MapReduce: "We invented it and now we are retiring it". Basically in-memory processing looks more and more appealing due to the need for quicker feedback cycle and speeding up processes. Also it seems that there is resurgence of focus towards in-memory grid computing, perhaps as a result of Actor Frameworks popularity.

In terms of technologies, Spark and to a degree Storm are getting a lot of traction and the next few months are essential to confirm the trend. These still very much come from a JVM ecosystem but there is potential in building competitor products.

Personal progress

MacBook

This is the first thing I did after making the decision 9 months ago: I bought a MacBook. I was probably the farthest thing away from being an Apple fanboy, but well it has put its hooks in me too now. I wasn't sure if I should get a Windows laptop and run a Linux VM on it, or buy a MacBook and run Windows VM. Funny enough, and despite my presumptions, I found the second option to be cheaper. In fact I could not find an Windows UltraBook with 16 GB of RAM and that is what I needed to be able to comfortably run a VIM. So buying a 13.3" MacBook pro proved both economical (in the light of what you get back for the money) and the right choice - since you want your VM to be your secondary platform.

Initially I did not like OSX but it helped me to get better at using the command line - be it the OSX variant of Linux commands. Six months on, similar to what some of my twitter friends had said, I don't think I will ever go back to Windows.

I have used Mac for everything apart from using Visual Studio and occasional Visio - also using some Azure tools had to be on Windows. I think I now spend probably only 20% of my time in Windows, the rest in Linux (Azure VM) and OSX.

Linux, Shell scripting and command line

I felt like an ignorant to find out the wealth of command line tools at my disposal in OSX and Linux. Find, Grep, Sort, Sed, tail, head, etc just amazing stuff. I admit, for some there might be windows equivalent that I have not heard of (which I doubt) but it really makes the life so easier to automate and manage your servers. So been working on understanding services on Linux and OSX, learning about Apache and how to configure it... I am no expert by any stretch but it has been fun and learnt a lot.
And yes, I did use VIM - and yes, I did find it difficult to exit it the first time :) I am not mad about it, I just have to use it on Linux VMs I manage configs, etc but cannot see myself using it for development - at least anytime soon.

Languages

As I said the, I had decided to start with some JVM languages. Scala felt the right choice then and with knowing more about it now, even more so. It is powerful yet all the wealth of Java libraries are at your fingertip. It is widely adopted (and Clojure the second candidate not so much). Erlang probably not appropriate now and go is non-JVM. so I am happy with that decision.

Having said that, I could not learn a lot of it. Instead I had to focus on Python for a personal NLP project - well, as you know most NLP and data science tools are on Python. I had to learn to code, understand its OOP and functional side, its versioning and distribution and finally above all being able to serve REST APIs (using Flask and RESTful-Flask) for interop with my other C# code.
My view on it? Python is simple and has a nice built-in support for important data structures (list, map, tuple, etc) making it ideal for working with the data. So it is a very useful language but it is not anywhere near as elegant as Scala or even C#. So for complex stuff, I would still rather coding in C#, until I properly pick up Scala again. I am also not very comfortable with distributing non-compiled code - although that is what we normally do in JavaScript (minimising aside), perhaps another point of similarity between these two.

Apart from these, I have still been doing a ton of C#, as I had predicted in the previous blog post. I have been working on a Cloud Actor Mini-Framework called BeeHive which I am currently using myself. I still enjoy writing C# and am planning to try out Mono as well (.NET on OSX and Linux). Having said that, I feel tools and languages best to be used in their native platform and ecosystem, so I am not sure if Mono would be a viable option for me.

Conclusion

All-in-all I think by embracing the non-Microsoft world, I have made the right decision. A new world has been suddenly opened up for me, a lot of exciting things to learn and to do. I wish I had done this earlier.

Would I think I will completely abandon my previous skills? I really doubt it: The future is not mono-colour, it is a democratised hybrid one, where different skillsets will result in cross-pollinisation and producing better software. It feel having a hybrid skill is becoming more and more important, and if you are looking to position yourself better as a developer/architect, this is the path you need to take. Currently cross-platform/hybrid skills is a plus, in 5 years it will be a necessity.

Wednesday 11 June 2014

BeeHive Series - Part 3: BeeHive 0.5, RabbitMQ and more

Level [T4]

BeeHive is a friction-free library to do Reactor Cloud Actors - effortlessly. It defines abstractions for the message, queue and the actors and all you have to do is to define your actors and connect their dots using subscriptions. If it is the first time you read about BeeHive, you could have a look at previous posts but basically a BeeHive Actor (technically Processor Actor) is very simple:
[ActorDescription("TopicName-SubscriptionName")]
public class MyActor : IProcessorActor
{
  public Task<IEnumerable<Event>> ProcessAsync(Event evnt)
  {
    // impl
  }
}
All you do is to consume a message, do some work and then return typically one, sometimes zero and rarely many events back.
A few key things to note here.

Event

First of all Event, is an immutable, unique and timestamped message which documents a significant business event. It has a string body which normally is a JSON serialisation of actual message object - but it does not have to be.

So usually messages are arbitrary bytes, why here it is a string? While it was possible to use byte[], if you need to send binary blobs or you need custom serialisation, you are probably doing something wrong. Bear in mind, BeeHive is targeted at solutions that require scale, High Availability and linearisation. If you need to attach a big binary blob, just drop it in a key value store using IKeyValueStore and put the link in your message. If it is small, use Base64 encoding. Also your messages need to very simple DTOs (and by simple I do not mean small, but a class with getters and setters), if you are having problem serialising them then again, you are doing something wrong.

Queue naming

BeeHive uses a naming conventional for queues, topics and subscriptions. Basically it is in the format of TopicName-SubscriptionName. So there are a few rules with this:
  • Understandably, TopicName or SubscriptionName should not contain hyphens
  • If the value of TopicName and SubscriptionName is the same, it is a simple queue and not a publish-subscribe queue. For example, "OrderArrived-OrderArrived"
  • If you leave off the SubscriptionName then you are referring to the topic. For example "OrderArrived-".
Queue name is represented by the class QueueName. If you need to construct queue names using static methods:

var n1 = QueueName.FromSimpleQueueName("SimpleQ"); // "SimpleQ-SimpleQ"
var n2 = QueueName.FromTopicName("Topic"); // "Topic-"
var n3 = QueueName.FromTopicAndSubscriptionName("topic", "Sub"); // "Topic-Sub"

There is a QueueName property on the Event class. This property defines where to send the event message. The queue name must be the name of the topic or simple queue name.

IEventQueueOperator

This interface got some make over in this release. I have not been happy the interface as it had some inconsistencies - especially in terms of creating . Thanks to Adam Hathcock who reminded me, now this is done.

With QueueName ability of differentiating topics and simple queue, this value needs to be either name of the simple queue (in the example above "SimpleQ") or the conventional topic name (in the example above "Topic-").

So here is the interface(s) as it stands now:

public interface ISubscriptionOperator<T>
{
    Task<PollerResult<T>> NextAsync(QueueName name);
    Task AbandonAsync(T message);
    Task CommitAsync(T message);
    Task DeferAsync(T message, TimeSpan howLong);
}

public interface ITopicOperator<T>
{
    Task PushAsync(T message);
    Task PushBatchAsync(IEnumerable<T> messages);
}

public interface IQueueOperator<T> : ITopicOperator<T>, ISubscriptionOperator<T>
{
    Task CreateQueueAsync(QueueName name);
    Task DeleteQueueAsync(QueueName name);
    Task<bool> QueueExists(QueueName name);
}

public interface IEventQueueOperator : IQueueOperator<Event>
{
}
Main changes were made to IQueueOperator<T> passing the QueueName which made it simpler.

RabbitMQ Roadmap

BeeHive targets cloud frameworks. IEventQueueOperator and main data structures have been implemented for Azure. Next is AWS.

Amazon Web Services (AWS) provides Simple Queue Service (SQS) which only supports simple send-receive scenarios and not Publish-Subscribe cases. With this in mind, it is most likely that other message brokers will be used although a custom implementation of pub-sub based on Simple Notification Service (SNS) has been reported. Considering RabbitMQ is by far the most popular message broker out there (is it not?) it is sensible to pick this implementation first.

RabbitMQ client for .NET has a very simple API and working with it is very easy. However, the connection implementation has a lot to be desired. EasyNetQ has a sophisticated connection implementation that covers dead connection refreshes and catering for round-robin in case of High-Availability scenario. Using a full framework to just the connection is not really an option hence I need to implement something similar.

So for now, I am realising an alpha version without the HA and connection refresh to get community feedback. So please do ping me what you think.

Since this is a pre-release, you need to use -Pre to get it installed:

PM> Install-Package BeeHive.RabbitMQ -Pre

Tuesday 3 June 2014

Cancelling an async HTTP request Task sends TCP RESET packet

Level [T4]

This blog post did not just happen. In fact, never, if ever, something just happens. There is a story behind everything and this one is no different. Looking back, it feels like a nice find but as the story was unfolding, I was running around like a headless chicken. Here we have the luxury of the hindsight so let's take advantage of it.

TLDR; If you are a sensible HTTP client and make your HTTP requests using cancellable async Tasks by passing a CancellationToken, you could find your IP blocked by legacy bridge devices blacklisting clients sending TCP RESET packets.

So here is how it started ...

So we were supposed to go live on Monday - some Monday. Talking of live, it was not really live - it was only to internal users but considering the high profile of the project, it felt like the D-Day. All VPs knew of the release and were waiting to see a glimpse of the project. Despite the high profile, it was not properly resourced, I despite being so called architect , pretty much singled handedly did all the API and the middleware connecting the Big Data outputs with the Single Page Application.

We could not finish going live on Monday so it moved to Tuesday. Now on Tuesday morning we were all ready and I set up my machine's screen like traders with all performance monitors up on the screen looking at users. With using the cloud Azure, elasticity was the option although the number of internal users could hardly make a dent on the 3 worker roles. So we did go live, and, I could see traffic building up and all looked fine. Until ... it did not.

I saw requests queuing up and loading the page taking longer and longer. Until it was completely frozen. And we had to take the site down. And that was not good.

Server Analysis

I brought up DebugView and was lucky to see this (actual IP and site names anonymised):

[1240] w3wp.exe Error: 0 :
[1240] <html>
[1240] <h1>Access Administratively Blocked</h1>
[1240] <br>URL : 'http://www.xyz.com/services/blahblah'
[1240] <br>Client IP address : 'xyz.xx.yy.zzz'
[1240] </html>

So we are being blocked! Something is blocking us and this could be because we used an UI data endpoint as a Data API. Well I knew it is not good but as I said we had a limited time and in reality that data endpoint was meant to support our live traffic.

So after a lot of to and fro with our service delivery and some third party support, we were told that our software was recognised as malicious since it was sending way too many TCP RESET packets. Whaa?? No one ain't sending no TCP whatever packets, we are using a high level language (C#) and it is the latest HttpClient implementation. We are actually using many optimising techniques such as async calls, parallelisation, etc to make the code as efficient as possible. We also used short timeout+ retry which is Netflix's approach to improve performance.

But what is TCP RESET packets? Basically a RESET packet is one that has the RESET flag set (which is otherwise unset) and tells the server to drop the TCP connection immediately and reclaim all the resources associated with it. There is an RFC from back in 2002 that considers RESET harmful. Wikipedia's article argues that when used as designed, it is useful but forged RESET can disrupt the communication between the client and server. And Microsoft's technet blog on the topic says "RESET is actually a good thing".

And in essence, I would agree with the Microsoft (and Wikipedia's) account that sending RESET packet is what a responsible client would do. Let's imagine you are browsing a site using a really bad wifi connection. The loading of the page takes too long and you frustrated by the slow connection, cancel browsing by pressing the X button. At this point, a responsible browser should tell the server it has changed its mind and is not interested in the response. This will let the server use its resources for a client that is actually waiting for the server's response.

Now going back to the problem at hand, I am not a TCP expert by any stretch - I have always used higher level constructs and never had to go down so deep in the OSI model. But my surprise was, what is different now with my code that with a handful calls I was getting blocked while the live clients work well with no problem with significantly larger number of calls?

I had a hunch that it probably has to do with the some of the patterns I have been using on the server. And to shorten the suspense, the answer came from the analysis of TCP packets when cancelling an async HTTP Task. The live code uses the traditional synchronous calls - none of the fancy patterns I used. So let's look at some sample code that cancels the task if it takes too long:

var client = new HttpClient();
var buffer = new byte[5 * 1000 * 1000];
// you might have to use different timeout or address
var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(300)); /
try
{
    var result = client.GetAsync("http://www.google.com",
    cts.Token).Result;
    var s = result.Content.ReadAsStreamAsync().Result;

    var result1 = s.ReadAsync(buffer, 0, buffer.Length, cts.Token).Result;
    ConsoleWriteLine(ConsoleColor.Green, "Got it");
}
catch (Exception e)
{
    ConsoleWriteLine(ConsoleColor.Red, "error! " + e);
}

In this snippet, we are calling the google server and set a 300ms timeout (which you might have to modify the timeout or the address based on your connection speed, in order to see the cancellation). Here is a WireShark proof:



As you can see above a TCP RESET packet has been sent - if you have set the parameters in a way that the request does not complete before its timeout and gets cancelled. You can try this with a longer timeout or use a WebClient which is synchronous and make sure you will never ever see this RST packet.

Now the question is, should a network appliance pick on this responsible cancellation and treat it as an attack? By no means. But in my case, it did and it is very likely that it could do that with yours.

My solution came by whitelisting my IP against "TCP RESET attacks". After all, I was only trying to help the server.

Conclusion

Cancelling an HTTP async Task in the HttpClient results in sending TCP RESET which is considered malicious by some network appliances resulting in blacklisting your IP.

PS. The network appliance belonged to our infrastructure 3rd party provider whose security managed by another third party - it was not in Azure. The real solution would have been to remove such crazy rule, but anyhow, we developers don't always get what we want.


Monday 2 June 2014

BeeHive Series - Part 2 - Importing file from blob storage to ElasticSearch sample

[Level T1]

In the previous post, we introduced BeeHive and talked about an example usage where we check news feeds and send a notification if a keyword is found. In this post, we look at another example. You can find the source code in the BeeHive Github repo. Just open up BeeHive.Samples.sln file.

Processing files

Let's imagine we receive files in a particular blob location and we need to import/process them into the system. These files arrive in a particular folder structure and we need to watch the root folder. Then we need to pick them up, extract each row and send each record to be processed - in this case to be loaded onto an ElasticSearch cluster.
ElasticSearch is a horizontally-scalable and highly-available indexing and search technology. It runs on Windows, Linux and OSX, easy to setup and free to use. You can download the installer from http://www.elasticsearch.org/

NewFileArrived

So here, we design a system that watches the location and when it finds the files, it raises NewFileArrived event. This is a simple enough process yet what if we have multiple actors watching the location (very likely for a cloud scenario where the same process runs on many machines)? In this case we will receive multiple NewFileArrived events.
BeeHive provides pulsers that help you with your concurrency problems. FolderWatcherActor can subscribe to a topic that is fed by a pulser. In fact, in a BeeHive world, you could have pulsers that raise events at different intervals and raise events such as FiveMinutesPassedAnHourPassedADayPassed, etc and based on the requirement, your actors could be subscribing to any of these. Beauty ofmessage-based scheduling is that only a single instance of the actor will be receiving the message.
Raising the NewFileArrived event is not enough. When the actor wakes up again by receiving the next message and the file is there, it will send another NewFileArrived error. We can protect against this by:
1) Making processing Idempotent 2) Keep track of files received 3) Mark files by creating a status file next to them
We choose the last option so we can use the same status file further down. So after identifying the file, we create a file with the same name plus .status and write the status number, here 1.

public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
{
    var events = new List<Event>();
    var items = (await _dynamoStore.ListAsync(
        _configurationValueProvider.GetValue(Constants.SweepRootPathKey)))
        .ToArray();

    var notProcessed = items.Where(x => !x.IsVirtualFolder)
        .GroupBy(z => z.Id.Replace(Constants.StatusPostfix, ""))
        .Where(f => f.Count() == 1)
        .Select(w => w.Single());

    foreach (var blob in notProcessed)
    {
        events.Add(new Event(new NewFileArrived()
        {
            FileId = blob.Id
        }));
        await _dynamoStore.InsertAsync(new SimpleBlob()
        {
            Id = blob.Id + Constants.StatusPostfix,
            Body = new MemoryStream(BitConverter.GetBytes(1)) // status 1
        });
    }

    return events;
}

Process the file: fan-out the records

After receiving the NewFileArrived, we copy the file locally and split the file to the records and fan out the records with ImportRecordExtracted. We also send a ImportFileProcessed event.
public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
{
    var newFileArrived = evnt.GetBody<NewFileArrived>();
    var blob = await _dynamoStore.GetAsync(newFileArrived.FileId);
    var reader = new StreamReader(blob.Body);
    string line = string.Empty;
    var events = new List<Event>();
    while ((line = reader.ReadLine())!= null)
    {
        var fields = line.Split(new []{','},StringSplitOptions.RemoveEmptyEntries);
        events.Add(new Event( new ImportRecordExtracted()
        {
            Id = fields[0],
            Content = fields[2],
            IndexType = fields[1]
        }));
    }

    events.Add(new Event(new ImportFileProcessed()
    {
        FileId = newFileArrived.FileId
    }));

    return events;
}

ImportFileProcessed

The actor receiving this event will delete the file and the status file.
public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
{
    var importFileProcessed = evnt.GetBody<ImportFileProcessed>();
    var statusFile = importFileProcessed.FileId + Constants.StatusPostfix;

    await _dynamoStore.DeleteAsync(new SimpleBlob()
    {
        Id = importFileProcessed.FileId
    });
    await _dynamoStore.DeleteAsync(new SimpleBlob()
    {
        Id = statusFile
    });

    return new Event[0];
}

ImportRecordExtracted

Based on the type of the record, we "upsert" the record in the appropriate index in our ElasticSearch cluster.
public async Task> ProcessAsync(Event evnt)
{
    var importRecordExtracted = evnt.GetBody();
    var elasticSearchUrl = _configurationValueProvider.GetValue(Constants.ElasticSearchUrlKey);

    var client = new HttpClient();
    var url = string.Format("{0}/import/{1}/{2}", elasticSearchUrl,
        importRecordExtracted.IndexType,
        importRecordExtracted.Id);
    var responseMessage = await client.PutAsJsonAsync(url, importRecordExtracted);

    if (!responseMessage.IsSuccessStatusCode)
    {
        throw new ApplicationException("Indexing failed. " 
            + responseMessage.ToString());
    }

    return new[]
    {
        new Event(new NewIndexUpserted()
        {
            IndexUrl = url
        }) 
    };
}

NewIndexUpserted

While we currently do not need to know when we add or update an index in the ElasticSearch, this can later be used by other processes, so it is best to provision the event. As we said before, BeeHive events are meaningful business milestones that may or may not be used by your current system.

Here are our indexes when browsing to http://localhost:9200/import/_search

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 14,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "import",
      "_type" : "D",
      "_id" : "4",
      "_score" : 1.0, "_source" : {"Id":"4","IndexType":"D","Content":"These are controlled by min_term_freq"}
    }, {
      "_index" : "import",
      "_type" : "E",
      "_id" : "9",
      "_score" : 1.0, "_source" : {"Id":"9","IndexType":"E","Content":"There are other parameters such as min_word_length"}
    }, {
      "_index" : "import",
      "_type" : "E",
      "_id" : "11",
      "_score" : 1.0, "_source" : {"Id":"11","IndexType":"E","Content":"In order to give more weight to more interesting terms"}
    }, {
      "_index" : "import",
      "_type" : "A",
      "_id" : "2",
      "_score" : 1.0, "_source" : {"Id":"2","IndexType":"A","Content":"clauses in a bool query of interesting terms extracted from some provided text. "}
    }, {
      "_index" : "import",
      "_type" : "D",
      "_id" : "7",
      "_score" : 1.0, "_source" : {"Id":"7","IndexType":"D","Content":"controlled by percent_terms_to_match. The terms are extracted from like_text "}
    }, {
      "_index" : "import",
      "_type" : "H",
      "_id" : "14",
      "_score" : 1.0, "_source" : {"Id":"14","IndexType":"H","Content":"score times some boosting factor boost_terms."}
    }, {
      "_index" : "import",
      "_type" : "B",
      "_id" : "3",
      "_score" : 1.0, "_source" : {"Id":"3","IndexType":"B","Content":"The interesting terms are selected with respect to their tf-idf scores. "}
    }, {
      "_index" : "import",
      "_type" : "D",
      "_id" : "8",
      "_score" : 1.0, "_source" : {"Id":"8","IndexType":"D","Content":"which is analyzed by the analyzer associated with the field"}
    }, {
      "_index" : "import",
      "_type" : "E",
      "_id" : "10",
      "_score" : 1.0, "_source" : {"Id":"10","IndexType":"E","Content":"max_word_length or stop_words to control what terms should be considered as interesting. "}
    }, {
      "_index" : "import",
      "_type" : "D",
      "_id" : "5",
      "_score" : 1.0, "_source" : {"Id":"5","IndexType":"D","Content":"The number of interesting terms is controlled by max_query_terms. "}
    } ]
  }
}


Cleanup processes

In the absence of transactions, business processes have to design the processes for failure. BeeHive promotes an approach that every process is broken down to its smallest elements and each implemented in an actor.

Sometimes it is necessary to design processes that look for the highly unlikely (yet possible) event of a failure when actor has done its work but the events returned never make it back to the service bus. In the case of inserting the new index, this is not a problem since we use PUT and the process is idempotent. However, this could be a problem in case of processing file where a status file is created but NewFileArrived never makes it back to the service bus. In this case, a crash unlocker process that checks the timestamp of the status file and deletes the file if older than e.g. 1 day, is all that is needed.

Conclusion

We can use pulsers to solve the inherent concurrency problem of multiple folder watcher actors watching the same folder. The fan-out process of breaking a file down to its record and parallilising the processing is one of the key benefits of cloud actors.



Thursday 22 May 2014

BeeHive Series - Part 1 - Getting started with BeeHive

[Level T1]

I feel that BeeHive has been the most important Open Source project I have been involved so far. Not because it is my latest project, but because I feel the potential is huge.

The infoq article that came out on Monday is pretty much a theoretical braindump on the Reactor Actor Model. I had expected this to stir up so much controversy as the claims I am making are pretty big - this has not happened yet. Maybe the text does not flow well or it is just early. In any case, the idea is clear: sticking to Processor Actors and build a web of loosely connected events to fulfil a system's business requirement while maintaining fluid evolvability. However, I fear the article was perhaps too dry and long and did not fully demonstrate the potential.

Hence I have set out to start a series on BeeHive and show the idea in some tangible scenarios "Show me the codz" - with the code easily accessible from GitHub. So now let's roll on.

BeeHive

BeeHive makes it ridiculously simple (and not necessarily easy) to build decoupled actors that together achieve a business goal. By using topic-based subscription, you can easily add actors that feed on an existing event and do something extra.
Here we will build such systems will a few lines of code. You can find the full solution in the samples folder in the source. The code below is mainly is a snippet (e.g does not have the Dispose() methods for removing clutter).

Scenario 1: News Feed Keyword Notification

Let's imagine you are interested in to know all breaking news from one or several news feeds and would like to be notified when a certain keyword occurs in the news. In this case we design a series of reactive actors that achieve this while allowing for other functionality to be built on top of existing topic based queues. We use Windows Azure for this example. In order to achieve this we need to regularly (e.g. every 5 minutes), check the news feed (RSS, ATOM, etc) and examine new items arrived and look for the interested keyword(s) and then perhaps tweet, send email or SMS text.

Pulsers: Activities on regular intervals

BeeHive works on a reactive event-based model. Instead of building components that regularly do some work, we can have generic components that regularly fire an event. These events then can be subscribed to by one or more actors to fire off the processing by a chain of decoupled actors.
BeeHive Pulsers do exactly that. On regular intervals, they are woken up to fire off their events. Simplest of pulsers, are assembly attribute ones:
[assembly: SimpleAutoPulserDescription("NewsPulse", 5 * 60)]
The code above sets up a pulser that every 5 minutes sends an event of type NewsPulse with empty body.

Dissemination from a list: Fan-out pattern

Next we set up an actor to look up a list of feeds (one per each line) and send an event per each feed. We have stored this list in a blob storage which is abstracted as IKeyValueStore in BeeHive.
[ActorDescription("NewsPulse-Capture")]
public class NewsPulseActor : IProcessorActor
{
  private IKeyValueStore _keyValueStore;
  public NewsPulseActor(IKeyValueStore keyValueStore)
  {
    _keyValueStore = _keyValueStore;
  }
  public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
  {
    var events = new List<Event>();
    var blob = await _keyValueStore.GetAsync("newsFeeds.txt");
    var reader = new StreamReader(blob.Body);
    string line = string.Empty;
    while ((line = reader.ReadLine())!=null)
    {
      if (!string.IsNullOrEmpty(line))
      {
        events.Add(new Event(new NewsFeedPulsed(){Url = line}));
      }
    }
    return events;
  }
}
ActorDescription attribute used above signifies that the actor will receive its events from the Capture subscription of the NewsPulse topic. We will be setting up all topics later using a single line of code.
So we are publishing NewsFeedPulsed event for each news feed. When we construct an Event object using the typed event instance, EventType and QueueName will be set to the type of the object passed - which is the preferred approach for consistency.

Feed Capture: another Fan-out

Now we will be consuming these events in the next actor in the chain. NewsFeedPulseActor will subscribe to NewsFeedPulsed event and will use the URL to get the lastest RSS feed and look for latest news. To prevent from duplicate notifications, we need to know what was the most recent tem we checked last time. We will store this offset in a storage. For this use case, we choose ICollectionStore<T> which its Azure implementation uses Azure Table Storage.
[ActorDescription("NewsFeedPulsed-Capture")]
public class NewsFeedPulseActor : IProcessorActor
{
    private ICollectionStore<FeedChannel> _channelStore;
    public NewsFeedPulseActor(ICollectionStore<FeedChannel> channelStore)
    {
        _channelStore = channelStore;
    }
    public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
    {
      var newsFeedPulsed = evnt.GetBody<NewsFeedPulsed>();
      var client = new HttpClient();
      var stream = await client.GetStreamAsync(newsFeedPulsed.Url);
      var feedChannel = new FeedChannel(newsFeedPulsed.Url);
      var feed = SyndicationFeed.Load(XmlReader.Create(stream));
      var offset = DateTimeOffset.MinValue;
      if (await _channelStore.ExistsAsync(feedChannel.Id))
      {
        feedChannel = await _channelStore.GetAsync(feedChannel.Id);
        offset = feedChannel.LastOffset;
      }
      feedChannel.LastOffset = feed.Items.Max(x => x.PublishDate);
      await _channelStore.UpsertAsync(feedChannel);
      return feed.Items.OrderByDescending(x => x.PublishDate)
        .TakeWhile(y => offset < y.PublishDate)
        .Select(z => new Event(new NewsItemCaptured(){Item = z}));
    }
  }
}
Here we read the URL from the event and capture the RSS and then get the last offset from the strorage. We then send the captured feed items back as events for whoever is interested. At the end, we set the offset.

Keyword filtering and Notification

At this stage we need to subscribe to NewsItemCaptured and check the content for specific keywords. This is only one potential subscription out of many. For example one actor could be subscribing to the event to store these for further retrieval, another to process for trend analysis, etc.
So for the sake of simplicity, let's hardcode the keyword (in this case "Ukraine") but it could have been equally loaded from a storage or a file - as we did with the list of feeds.
[ActorDescription("NewsItemCaptured-KeywordCheck")]
public class NewsItemKeywordActor : IProcessorActor
{
    private const string Keyword = "Ukraine";
    public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
    {
        var newsItemCaptured = evnt.GetBody<NewsItemCaptured>();
        if (newsItemCaptured.Item.Title.Text.ToLower()
            .IndexOf(Keyword.ToLower()) >= 0)
        {
            return new Event[]
            {
                new Event(new NewsItemContainingKeywordIentified()
                {
                    Item = newsItemCaptured.Item,
                    Keyword = Keyword
                }),
            };
        }
        return new Event[0];
    }
}
Now we can have several actors listening for NewsItemContainingKeywordIentified and send different notifications, here we implement a simple Trace-based one:

[ActorDescription("NewsItemContainingKeywordIentified-TraceNotification")]
public class TraceNotificationActor : IProcessorActor
{
  public async Task<IEnumerable<Event>> ProcessAsync(Event evnt)
  {
    var keywordIentified = evnt.GetBody<NewsItemContainingKeywordIentified>();
    Trace.TraceInformation("Found {0} in {1}",
        keywordIentified.Keyword, keywordIentified.Item.Links[0].Uri);
    return new Event[0];
  }
}

Setting up the worker role

If you have an Azure account, you need a storage account, Azure Service Bus and a worker role (even an Extra Small instance would suffice). If not, you can use development emulators although for the Service Bus you need to use Service Bus for windows. Just bear in mind, with local emulators and Service Bus for Windows, you have to use special versions of Azure SDK - latest versions usually do not work.
We can get a list of assembly pulsers by the code below:

_pulsers = Pulsers.FromAssembly(Assembly.GetExecutingAssembly())
    .ToList()

Also we need to create an Orchestartor to set up factory actors. We need to call SetupAsync() to set up all the topics and subscriptions.
Also we need to register our classes against Dependency Injection framework.

Now we are ready!

After running the application in debug mode, here is what you see in the output window:
Found Ukraine
Obviously, we can add another actor to subscribe to this event to send email, SMS messages, you name it. Point being, it is a piece of cake. Checking the links, and they indeed are about Ukraine:


And this one:


In the next post, we will discuss another scenario.