Monday, 4 June 2018

CacheCow.Server 2.0: Using it on ASP.NET Core MVC

CacheCow 2.0 Series:

Part 1 - CacheCow 2.0 is here - supporting .NET Standard and ASP.NET Core MVC
Part 2 - CacheCow.Client 2.0: HTTP Caching for your API Calls
Part 3 - CacheCow.Server 2.0: Using it on ASP.NET Core MVC [This post]
Part 4 - CacheCow.Server for ASP.NET Web API [Coming Soon]
Epilogue: side-learnings from supporting Core [Coming]

HTTP Caching on the server

In HTTP Caching, the server has two main responsibilities:

Supplying cache directives - including Cache-Control (and ETag, Last-Modified and Vary) headers
Performing conditional validating requests

These two have been explained in full in Part 1 of this series (It might be useful to have a quick review of that post to re-cap and refresh your memories).

In CacheCow 1.x, it was assumed that resources can be updated only through the API, i.e. all modifications had to go through the API - if modifications were done outside the agent modifying was responsible for invalidating API cache: obviously this is not always possible or architecturally correct. As I explained in Part 1, this was a big assumption and even with such an assumption in place, the interaction between various resources made cache invalidation very difficult for the API layer: change to a single resource invalidates its collection resource and in case of hierarchical resources, change to a leaf resource invalidates higher up.

CacheCow 2.x solves these problems by making the data and data providers take part in cache validation through new a couple of new abstractions. The following two sections are essential to understand the full value you get from CacheCow.Server, apart from setting Cache-Control header which is frankly not rocket science. But if you get bored and want to see some code, feel free to skip to the Getting Started section and come back to these later on.

ASP.NET Core supports HTTP Caching, why not use that instead?

ASP.NET Core has provided several primitives for HTTP Caching to:

Generate cache directives (and respond to conditional requests)
Caching middleware to cache representations on the server (an in-process HTTP proxy)

You are, of course, free to use these additional primitives. At the end of the day, generating cache directives requires bunch of if/else statements and reading from some config (for the max-age, etc). But bear in mind these points:

ASP.NET Core caching has started where CacheCow 0.x was. It suffers essentially from some of the same issues in CacheCow 0.x/1.x where the API layer is unaware of cache invalidation when it happens outside the API layer or when there is relationship between different resources. In fact CacheCow 1.x was capable of understanding relationship between single and collection resources as long as you would adhere to simple naming convention and it also provided invalidation mechanism. AFAIK these do NOT exist in ASP.NET Core caching and this could be potentially harmful if you care about consistency of your resources. (Consistency was explain in Part 1)
CacheCow.Server now provides the constructs to make the data source (and not the API) the authority with regard to ETag and conditional validating calls - as it is normally the case. This also solves the problem of cache invalidation when data is modified but not through the API - perhaps arrival of new batch data or when the underlying data store is exposed via different mechanisms which is common in the industry.
To my knowledge, caching middleware does not do validation. Again if you care about consistency of your resources then this is probably not for you.

The new shiny CacheCow.Sever 2.x

As explained, CacheCow.Server moves controlling of cache validation to where it belongs: your data and data providers. And in order to make that seamless and not pollute your API, these have been abstracted away to several constructs. These are explained in more detail on github so check that out if you need the next level details.

ITimedETagExtractor

CacheCow blends versioning identity of a resource into TimedETag (an Either of Last-Modified or ETag). ITimedETagExtractor extracts TimedETag from a resource. Extraction by default uses SHA1 hash of the json-serialised response payload (which has understandable overhead) but it would be very easy for you to provide an alternative mechanism for extracting the TimedETag. Many database tables have a timestamp column which you can use as Last-Modified. But since HTTP data does not have sub-second accuracy and this might mean missing updates done in the same second, it is best if you turn that into an opaque hash-like value as ETag - e.g. this:

public static string TurnDatetimeOffsetToETag(DateTimeOffset dateTimeOffset)
{
    var dateBytes = BitConverter.GetBytes(dateTimeOffset.UtcDateTime.Ticks);
    var offsetBytes = BitConverter.GetBytes((Int16)dateTimeOffset.Offset.TotalHours);
    return Convert.ToBase64String(dateBytes.Concat(offsetBytes).ToArray());
}

For the collection resources (e.g. orders vs. order) where there are a number of LastUpdated fields, all you need is Max(LastUpdated) and total count. You can combine these two values into a byte array buffer and serve the base64 representation directly or if you want it completely opaque, use its hash.

You can implement ICacheResource interface on your view models (payloads you return from your actions) to extract the value and return TimedETag - it is a very simple interface with a single method.

ITimedETagQueryProvider

Generating TimedETag from the resources is all well and good but it means that we have to bring the resource all the way to the API layer to generate the TimedETag. This is completely acceptable if we then are serving the resource. But what if we are doing cache validation from a client asking to respond with the whole resource only if things have changed (and NotModified 304 status otherwise)? In that case, we might load the whole resource to find out it has not changed and waste computation effort - also the pressure on the backend systems stays regardless of whether resource was modified or not. Bear in mind, in this case there is still some saving on network and computing but surely we can do better than this.

The solution is to use ITimedETagQueryProvider to preemptively query your backend for the current status of the resource, i.e. TimedETag. Usually getting that piece of information about the resource is much cheaper than returning the whole resource. For a single resource, you just need the e.g. timestamp field and for the collection resource, Max(timestamp) and count - if you are using RDBMS, all of this can be conveniently achieved in a single query e.g. "SELECT COUNT(1), MAX(LastUpdated) FROM MyTable WHERE IsActive = 1".

Understanding trade-offs of various approaches

All of above said, you do not necessarily have to implement ITimedETagExtractor and ITimedETagQueryProvider. In fact you can use the CacheCow.Server out-of-the-box and it will fulfil all server-side caching duties. The point is if you would like optimal performance, you have got to do a bit more work. The table below explains your various options and the benefits you get.

Table 1: CacheCow.Server - trade-offs and options

No need for storage anymore

CacheCow.Server 1.x had a need for some storage to keep the current TimedETag of the resource. Now that the TimedETag is generated or queried, there is no more such need. All solutions to do with EntiyTagStore in the CacheCow.Server have been removed in the repo.

Getting started with CacheCow.Server on ASP.NET Core MVC

Documentation in github is pretty clear, I believe, but for the sake of completeness I am bringing some of it here too. This covers the basic case with default implementations.

Essentially, all you need is a filter to decorate your actions, specifying the cache expiry duration in seconds. There are a bunch of other knobs but at this point, let's focus on the default scenario.

1. Add the package from nuget

In your package-manager console type below:

PM> install-package CacheCow.Server.Core.Mvc

2. Add CacheCow's default dependencies

public virtual void ConfigureServices(IServiceCollection services)
{
    ... // usual startup code
    services.AddHttpCachingMvc(); // add HTTP Caching for Core MVC
}

3. Decorate the action with the HttpCacheFactory filter

public class MyController : Controller
{
    [HttpGet]
    [HttpCacheFactory(300)]
    public IActionResult Get(int id)
    {
        ... // implementation
    }
}

Here we are defining the expiry to be 300 seconds (= 5 minutes). This means the client will cache the result for 5 minutes and after 5 minutes will keep asking if the resource has changed using conditional GET requests (see Part 2 for more info).

4. Check all is working

That should be all you need to have up and running. Now make a call to your API and you should see the Cache-Control header. You can use postman, fiddler or any other tool... you will basically see something like this:

Vary: Accept
ETag: "SPQT7RzH1QgBAAEAAAA="
Cache-Control: must-revalidate, max-age=300, private
x-cachecow-server: validation-applied=True;⏎
  validation-matched=False;short-circuited=False;query-made=True
Date: Thu, 31 May 2018 17:35:42 GMT

As you can see, CacheCow has added Vary, ETag and Cache-Control. There is also a diagnostic header, x-cachecow-server, that explains what CacheCow has performed to generate the response.

Now you can test if the conditional case by sending a GET request with the header below:

If-None-Match: "SPQT7RzH1QgBAAEAAAA="

And the server will respond with 304 if your resource has not changed.

More complex scenarios

Before we go into more details, it might be useful to go to CacheCow's github repo and review the ASP.NET Core MVC sample. Build and run it, play around and browse the code. This will make the discussions below closer to home as it details how to cater for various scenarios.

Table 1 (further above) is your guide in deciding which interface to implement.

Implementing ITimedETagExtractor or ICacheResource

As mentioned above serialisation is a heavy-handed approach to generating TimedETag. While OK for low-to-mid level load, for high performance you would be best either implement ICacheResource on your view models (what you return back from your action) or if you do not want dependency to a caching library for your view models, implement ITimedETagExtractor to extract TimedETag from your view models.

If you implement ICacheResource, you do not have to register anything additionally but if you implement ITimedETagExtractor for your view models, you have to register them.

There are examples on the samples.

Implementing ITimedETagQueryProvider

By implementing ITimedETagQueryProvider, you protect your backend system so that cache validation can be achieved without bring the view model all the way to the API layer to extract/generate TimedETag.

There are examples on the samples.

Dependency Injection and differentiation of ViewModels

Implementing ITimedETagQueryProvider or ITimedETagExtractor for different view models most likely involve different code. Since normally only a single implementing is registered against an interface, such implementation should check the type and then apply the appropriate code which breaks several programming principles.

You can use generic interfaces ITimedETagQueryProvider<TViewModel> and ITimedETagExtractor<TViewModel> to implement and then register. Then, in your filter, annotate the type of the view model. For example:

[HttpGet]
[HttpCacheFactory(0, ViewModelType = typeof(Car))]
public IActionResult Get(int id)
{
    var car = _repository.GetCar(id);
    return car == null
        ? (IActionResult)new NotFoundResult()
        : new ObjectResult(car);
}

This means that you have implemented ITimedETagExtractor<Car> and ITimedETagQueryProvider<Car> and registered them in your IoC.

You would be registering these in your application using extension methods in CacheCow (depending which interfaces you have implemented):

public virtual void ConfigureServices(IServiceCollection services)
{
    ... // register stuff
    services.AddQueryProviderForViewModelMvc<TestViewModel, TestViewModelQueryProvider>();
    services.AddQueryProviderForViewModelMvc<IEnumerable<TestViewModel>, TestViewModelCollectionQueryProvider>();
}

Other options for register implementations are: AddExtractorForViewModelMvc, AddSeparateDirectiveAndQueryProviderForViewModelMvc or AddDirectiveProviderForViewModelMvc. Some of these extension methods are essentially helpers that combine registration of multiple types.

Conclusions

CacheCow.Server is now relying on the data and data providers to take part in TimedETag generation and cache validation instead of storing and maintaining TimedETag and making guesses about the cache validation. This reduces the need for storage and making CacheCow a reliable solution capable of providing caching with air-tight consistency.

ASP.NET Core's HTTP Caching features are a good start but they lack some fundamental features thus I advise you to use CacheCow.Server instead - although I cannot guarantee that my views as the creator of CacheCow could be free of bias - just try and see for yourself and pick what works for you.

Wednesday, 16 May 2018

CacheCow.Client 2.0: HTTP Caching for your API Calls

CacheCow 2.0 Series:

Part 1 - CacheCow 2.0 is here - supporting .NET Standard and ASP.NET Core MVC
Part 2 - CacheCow.Client 2.0 [This post]
Part 3 - CacheCow.Server 2.0: Using it on ASP.NET Core MVC
Part 4 - CacheCow.Server for ASP.NET Web API [Coming Soon]
Epilogue: side-learnings from supporting Core [Coming]

State of Client HTTP Caching in .NET

Before CacheCow, the only way to use HTTP caching was to use Windows/IE caching through WebRequestHandler as Darrel explains here. AFAIK, this class no longer exists in .NET Standard due to its tight coupling with Windows/IE implementations.

I set out to build a store-independent caching story in .NET around 6 years ago and named it CacheCow and after these years I am still committed to maintain that effort.

Apart from a full-blown HTTP Caching, I had other ambitions in the beginning, for example, I had plans so you could limit caching per domain, etc. It became evident that this story is neither a critical feature nor possible in all storage techs. The underlying data structure requirement for cache storage is key-value while this feature required more complex querying. I accomplished implementing it for some storages but never was really used. That is why I no longer pursue this feature and it has been removed from CacheCow.Client 2.0. It is evident that unlike browsers, virtually all HttpClient instances would communicate with a handful of APIs and storage in this day and age is hardly a problem.

CacheCow.Client Features

The features of CacheCow 2.0 is pretty much unchanged since 1.x other than that now it supports .NET Standard 2.0+ hence you can use it in .NET Core and on platforms other than Windows (Linux/Mac).

In brief:

Supporting .NET 4.52+ and .NET Standard 2.0+
Store cacheable responses
Supports In-Memory and Redis storages - SQL is coming too (and easy to build your own)
Manage separate query/storage of representations according to server's Vary header
Validating GET calls to validate cache after expiry
Conditional PUT calls to modify a resource only if not changed since (can be turned off)
Exposing diagnostic x-cachecow-client header to inform of the caching result

Using CacheCow.Client is effortless and there are hardly any knobs to adjust - it hides away all the caching cruft that can get in your way of consuming an API efficiently.

CacheCow.Client has been created as a DelegatingHandler that needs to be added to the HttpClient's HTTP pipeline to intercept the calls. We will look at some use typical use cases.

Basic Use Case

Let's imagine you have a service that needs to consume a cacheable resource and you are using HttpClient. Here are the steps to follow:

Add a Nuget dependency to CacheCow.Client

Use command-line or UI to add a dependency to CacheCow.Client version 2.x:

> install-package CacheCow.Client

Create an HttpClient

CacheCow.Client provides a helper method to create an HttpClient with caching enabled:

var client = ClientExtensions.CreateClient();

All this does is to create an HttpClient with CacheCow's CachingHandler added to the pipeline fronted by the HttpClientHandler.

You can pass the cache store (an implementation of ICacheStore) in an overload of this method but here we are going to use the default In-Memory store suitable for our use case.

Make two calls to the cacheable resource

Now we make a GET call to get a cacheable resource and then another call to get it again. From examining the CacheCow header we can ascertain second response came directly from the cache and never even hit the network.

const string CacheableResource = "https://code.jquery.com/jquery-3.3.1.slim.min.js";
var response = client.GetAsync(CacheableResource).
      ConfigureAwait(false).GetAwaiter().GetResult();
var responseFromCache = client.GetAsync(CacheableResource).
      ConfigureAwait(false).GetAwaiter().GetResult();
Console.WriteLine(response.Headers.GetCacheCowHeader().ToString()); 
// outputs "2.0.0.0;did-not-exist=true"
Console.WriteLine(responseFromCache.Headers.GetCacheCowHeader().ToString()); 
// outputs "2.0.0.0;did-not-exist=false;retrieved-from-cache=true"

Using alternative storages - Redis

If you have 10 boxes calling an API and they are using an In-Memory store, the response would have to be cached separately on each box and the origin server will be hit potentially 10 times. Also due to dispersion, usefulness of the cache is reduced and you will see lower cache hit ratio.

In high-throughput scenarios you would want to use a distributed cache such as Redis. CacheCow.Client 1.x used to support Azure Fabric Cache (discontinued by Microsoft), two versions of Memcached, SQL Server, ElasticSearch, MongoDB and even File. Starting with 2.x, new storages will be added only when they absolutely make sense. There is a plan to migrate SQL Server storage but as for the others, there is currently no such plans. Having said that, it is very easy to implement your own and we will look into this further down in this post (I have chosen LMDB, a super fast file-based storage by Howard Chu).

For this case, we would like to use Redis storage. In case you do not have access to an instance of Redis, you can download (Mac/Linux or Windows) and run Redis locally without installation.

Add a dependency to Redis store package

After running your Redis (or perhaps creating one in the cloud), add a dependency to CacheCow.Client.RedisCacheStore:

> install-package CacheCow.Client.RedisCacheStore

Create an HttpClient with a Redis store

We use the ClientExtensions to create a client with a Redis store - here it connects to a local cache:

var client = ClientExtensions.CreateClient(new RedisStore("localhost"));

CacheCow.Client.RedisCacheStore library uses StackExchange.Redis, the de-facto Redis client library in .NET, hence it can accepts connection string according to StackExchange.Redis conventions as well as IDatabase, etc to initialise the store.

Make two calls to a cacheable resource

Rest of the code is the same as with In-Memory scenario, making two HTTP calls to the same cacheable resource and observing the CacheCow headers - see above.

Cache Validation

Cacheable resources provide a validator so that the client can validate whether the version they have is still current. This was explained in the previous post, but essentially representation's ETag (or Last-Modified) header gets used to validate the cached resource with the server. CacheCow.Client already does this for you so you do not have to worry about it.

Another aspect of cache validation is on PUT calls so that the resource gets modified only if it has not changed since you have received it. This is essentially optimistic concurrency which is beautifully implemented in HTTP using validators. CacheCow.Client does this by default but there is a property on CachingHandler if you need to turn it off. In case you would wish to do so (or to change any other aspect of the CachingHandler), create the client without the ClientExtension:

var handler = new CachingHandler()
{
 InnerHandler = new HttpClientHandler(),
 UseConditionalPut = false
};

var c = new HttpClient(handler);

There are bunch of other knobs that are provided for some edge cases so you could modify the default behaviour but they are pretty self-explanatory and not worth going into much details. Just browse public properties of CachingHandler and GitHub or StackOverflow is the best place to discuss if you have a question.

Supporting other storages - implementing ICacheStore for LMDB

CacheCow separates the storage from the HTTP caching functionality hence it is possible to plug-in your own storage with a few lines of code.

ICacheStore is a simple interface with 4 async methods:

public interface ICacheStore : IDisposable
{
    Task<HttpResponseMessage> GetValueAsync(CacheKey key);
    Task AddOrUpdateAsync(CacheKey key, HttpResponseMessage response);
    Task<bool> TryRemoveAsync(CacheKey key);
    Task ClearAsync();

}

LMDB is a lightning-fast database (as the name implies) that has a support in .NET, thanks to Cory Kaylor for his OSS project Lightning.NET. The project needs some more love and care fixing some of the build issues and updating to the latest frameworks but it is a great work.

This scenario is useful especially if you need a local persistent store.

The implementation is pretty straightforward and we use Put, Get, Delete and Truncate methods of LightningTransaction to implement UpdateAsync, GetValueAsync, TryRemoveAsync and ClearAsync functionality. For Dispose, we just need to dispose the lightning environment.

Here is a pretty typical implementation:

using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using CacheCow.Client;
using CacheCow.Client.Headers;
using CacheCow.Common;
using LightningDB;

namespace CacheCow.Client.Lightning
{
    public class LightningStore : ICacheStore
    {
        private readonly LightningEnvironment _environment;
        private readonly string _databaseName;
        private readonly MessageContentHttpMessageSerializer _serializer = new MessageContentHttpMessageSerializer();
        
        public LightningStore(string path, string databaseName = "CacheCowClient")
        {
            _environment = new LightningEnvironment(path);
            _environment.MaxDatabases = 1;
            _environment.Open();
            _databaseName = databaseName;
        }

        public async Task AddOrUpdateAsync(CacheKey key, HttpResponseMessage response)
        {
            var ms = new MemoryStream();
            await _serializer.SerializeAsync(response, ms);
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName, 
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.Put(db, key.Hash, ms.ToArray());
                tx.Commit();
            }
        }

        public Task ClearAsync()
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.TruncateDatabase(db);
                tx.Commit();
            }

            return Task.CompletedTask;
        }

        public void Dispose()
        {
            _environment.Dispose();
        }

        public async Task<HttpResponseMessage> GetValueAsync(CacheKey key)
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                var data = tx.Get(db, key.Hash);
                if (data == null || data.Length == 0)
                    return null;
                var ms = new MemoryStream(data);
                return await _serializer.DeserializeToResponseAsync(ms);
            }
        }

        public Task<bool> TryRemoveAsync(CacheKey key)
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.Delete(db, key.Hash);
                tx.Commit();
            }

            return Task.FromResult(true);
        }                
    }
}

Conclusion

CacheCow.Client is a simple and straightforward to get started with. It supports In-Memory and Redis storages and storage of your choice can be plugged-in with a handful lines of code - here we demonstrated that for LMDB. It is capable of carrying out GET and PUT validation, making your client more efficient and your data more consistent.

In the next post, we will look into Server scenarios in ASP.NET Core MVC.

Sunday, 13 May 2018

CacheCow 2.0 is here - now supporting .NET Standard and ASP.NET Core MVC

CacheCow 2.0 Series:

Part 1 - CacheCow 2.0 is here - supporting .NET Standard and ASP.NET Core MVC [This post]
Part 2 - CacheCow.Client 2.0: HTTP Caching for your API calls
Part 3 - CacheCow.Server 2.0: Using it on ASP.NET Core MVC
Part 4 - CacheCow.Server for ASP.NET Web API [Coming Soon]
Epilogue: side-learnings from supporting Core [Coming]

So, no CacheCore in the end!

Yeah. I did announce last year that the new updated CacheCow will live under the name CacheCore. The more I worked on it, the more it became evident that only a tiny amount of CacheCow will ever be Core-related. And frankly trends come and go, while HTTP Caching is pretty much unchanged for the last 20 years.

So the name CacheCow lives on, although in the end what matters for a library is if it can solve any of your problems. I hope it will and carry on doing so. Now you can use CacheCow.Client with .NET 4.52+ and .NET Standard 2.0+. Also CacheCow.Server also supports both Web API and ASP.NET Core MVC - and possibly Nancy soon!

CacheCow 2.0 has lots of documentation and the project now has 3 sample projects covering both client and server sides in the same project.

CacheCow.Server has changed radically

The design for the server-side of the CacheCow 0.x and 1.x was based on the assumption that your API is a pure RESTful API and the data only changes through calling its endpoints so the API layer gets to see all changes to its underlying resources. The more I explored and over the years, this turned out to be a pretty big assumption in the end, and is realistic only in the REST La La Land - a big learning for me. And even if the case is true, the relationship between resources resulted in server-side cache directive management to be a mammoth task. For example in the familiar scenario of customer-product-orders, if an order changes, the cache for the collection of orders is now invalidated - hence the API needs to understand which resource is collection of which. What is more, change in customer could change the order data (depending on implementation of course, but just the take it for the sake of argument). So it meant that the API now has to know a lot more: single vs collection resources, relationship between resources... it was a slippery slope to a very bad place.

With removing that assumption, the responsibility now lies with the back-end stores which provide data for the resources - they will be queried by a couple of constructs added to CacheCow.Server. If you opt to implement that part for your API, then you have a supper-efficient API. If not, there are some defaults there to do the work for you - although super-optimal. All of this will be explained in the CacheCow.Server posts, but the point is CacheCow.Server is now a clean abstraction for HTTP Caching, as clean as I could make it. Judge for yourself.

What is HTTP Caching?

Caching is a very familiar notion in programming and pretty much every developer uses it on a regular basis. This familiarity has a downside to it since HTTP Caching is more complex and in many ways different to the routing caching in code - hence it is very common to see misunderstandings even amongst senior developers. If you ask an average developer this question: "In HTTP Caching, where the cache data gets stored?" it is probably more likely to hear the wrong answer "server" than the correct answer "client". In fact, many developers are looking for to improve their server-side code's performance by turning on the caching on the server, while if the callers ignore the caching directives it will not result in any benefit.

This reminds me of a blog post I wrote 6 years ago where I used HTTP Caching as an example of mixed-concern (as opposed to server-concern or client-concern) where "For HTTP caching to work, client and server need to work in tandem". This a key difference with the usual caching scenarios seen everyday. What makes HTTP Caching even more complex is the concurrency primitives, built-in starting with HTTP 1.1 - we will look into those below.

I know HTTP Caching is hardly new and has been explained many times before. But considering number of times I have seen being completely misunderstood, I think it deserves your 5-10 minutes - even though as refresher.

Resources vs. Representations

REST advocates exposing services through a uniform API (where HTTP is one such implementation) allowing resources to be created, modified and queried using the API. A resource is addressed by its location identifier or URL (e.g. /api/car/123). When a client requests a resource, only a representation of the resource is sent back. This means that the client receives only a representation out of many possible representations. This also would mean that when the client caches the representation, this representation is only valid if the the representation requested matches the one cached. And finally, a client might cache different representations of the same resource. But what does all of this mean?

HTTP GET - The server serving a representation of the resource. Server also send cache directives.

A resource could be represented differently in terms of format, encoding, language and other presentation concerns. HTTP provides semantic for the client to express its preferences in such concerns with headers such as Accept, Accept-Language and Accept-Encoding. There could be other headers that can result in alternative representations. The server is responsible for returning the definitive list of such headers in the Vary header.

Cache Directives

Server is responsible for returning cache directives along with the representation. Cache-Control header is the de-factor cache directive defining whether the representation can be cached, for how long, whether by the end client or also by the HTTP intermediaries/proxies, etc. HTTP 1.0 had the simple Expires header which only defined absolute expire time of the representation.

You could also think of other cache-related headers as cache directives (although purely speaking they are not) such as ETag, Last-Modified and Vary.

Resource Version Identifiers (Validators)

HTTP 1.1 defines ETag as an opaque identifier which defines the version of the resource. ETag (or EntityTag) can be strong or weak. Normally a strong ETag identifies version of the representation while a weak ETag is only at the resource level.

Last-Modified header was the main validator in HTTP 1.0 but since it is based on a date with up-to-a-second precision, it is not suitable for achieving high consistency since a resource could change multiple times in a second.

CacheCow supports both validators (ETag and Last-Modified) and combines these two notions in the construct TimedETag.

Validating (conditional) HTTP Calls

A GET call can request the server for the resource with the condition that the resource has been modified with respect to its validator. In this case, the client sends ETag(s) in the If-None-Match header or Last-Modified date in the If-Modified-Since header. If validation matches and no change was made, the server returns status 304 otherwise the resource is sent back.

For a PUT (and DELETE) call, the client sends validators in If-Match or If-Unmodified-Since. The server performs the action if validation matches otherwise status 412 is sent back.

Consistency

The client normally caches representations longer than the expiry and after the expiry it resorts to validating calls and if they succeed it can carry on using the representations.

In fact the sever can return representations with immediate expiry forcing the client to validate every time before using the cache resource. This scenario can be called High-Consistency caching since it ensures the client always uses the most recent version.

Is HTTP Caching suitable for my scenario?

Consider using HTTP Caching if:

Both your client and server are cache-aware. The client either is a browser which is the ultimate HTTP machine well capable of handling cache directives or a client that understands caching such as HttpClient + CacheCow.Client.
You need a High-Consistency caching and you cannot afford clients to use outdated data
Saving on network bandwidth is important

HTTP Caching is unsuitable for you if:

Your client does not understand/implement HTTP caching
The server is unable to provide cache directives

In the next post, we will look into CacheCow.Client.

Monday, 19 March 2018

Business and Log Events, Azure EventHub and Psyfon

TLDR; If you need to send large number of events to Azure EventHub from a .NET process or passthru API, consider using psyfon.

Over the last two decades, many businesses have transformed themselves and modeled their processes and operations as software (bespoke or customising off-the-shelf products). These systems would turn business processes and transactions into data that can be stored, queried or exchanged - ROI for such data is very high and the challenges of building/evolving such systems have been widely known. These systems typically generate business events.

Businesses have been turning their attention to the next goal: capturing (and analysing in near real-time) the information that commonly not considered as valuable data, such as minute user interactions with sites/apps down to the level of mouse movements and scrolls, sensor outputs in vehicles or factories, CCTV streams from municipal cameras to predict/forecast traffic, shopper interactions/behaviour in supermarkets to gain insight/provide recommendations, etc. These systems generate what I - for better or worse - call log events which I have explained in the past here but would be useful to re-cap their differences with business events in the table below.

While log events could have been historically stored and then analysed in batch mode, there is growing need to make some sense of the data in real-time in addition to in-depth analysis in offline mode. That is essentially stream processing.

Stream processing is hard. Building resilient processes to be able to reliably process tons of data in parallel while handling back-pressure, point failures, peaks of activity - all of which with few seconds or even sub-second latency - is not trivial. There are such systems already available such as Apache Flink, Kafka Streams or Spark Streaming. These systems typically work on top of an Event Store such as Kafka or Azure EventHubs.

Azure EventHub has been built for publishing and consuming events at high-scale. The design is not dissimilar to that of Kafka: a replicated/Highly-Available log per arbitrary (but constant) number of partitions where ordering can be guaranteed only at the partition level. You can read from the beginning of the log or from any point in the stream but remembering where you last read events from (checkpointing) is completely left to the consumers.

Typically only a single consumer is meant to read from a partition hence having more partitions is important for improving scalability. In terms of publishing, this is much more laxed: a high number of producers can send events to EventHub.

How does EventHub assign events to partitions? You can optionally send a Partition Key which gets hashed and used for assigning to partitions. To make sure you get the best out of your system, the Partition Key needs to be evenly distributed. If you are sending device events, you would most likely use the DeviceId. For customer events, Customer ID is a natural choice. This will ensure all events for a device or customer are ordered according to the time they are arrived at the EventHub.

Usually there are data pipelines that receive and funnel the incoming data (usually through a passthru API) to these stores but the key point is these data pipelines exhibit the same challenges shared by the stream processing. While initially exposing EventHub directly to the outside world was advocated by Microsoft, you would most likely want to hide your EventHub behind a passthru API that does authentication and optimises delivery of the events to the EventHub by batching. This layer is also useful to handle back-pressure by buffering events so you can deal with spikes gracefully. One thing you cannot do here at the API is to keep the caller waiting for event to be successfully committed to the EventHub for a few reasons. First of all, EventHub can sometimes have latency in the order of 100-150ms. While this is completely acceptable for most purposes, (other than High-Frequency Trading!), keeping clients waiting means more power consumption for publishers many of whom are phones and other low-power devices, sending many events per hour. Another reason is that EventHub works best if you send events in batches hence waiting until your buffer is full and then committing the batch of events.

Batching is already supported built-in with the EventHub:

var batch = new EventDataBatch("myPartitionKey");
batch.TryAdd(eventData); // keep adding until method returns false 
await client.SenAsync(batch);

But did you notice something? All events within a batch must have the same partition key. While it is understandable Microsoft made this decision for performance reasons - since all such events will be sent to the same partition otherwise batch has to wait for all partitions involved to respond successfully - it essentially renders batching remarkably less useful. As said earlier, Partition Key must have widely diverse value such as Customer ID or Device ID. There is no guarantee that an event arrived from a customer at the API is followed by enough events from the same customer in a reasonable amount of time to fill the batch and make batching worthwhile - let alone those events arriving at exactly the same web-head.

Solution is to essentially send the events directly to partitions. But the hashing takes place at the EventHub, how could we know what Partition Key gets allocated to which partition? This implementation is opaque and is not possible to reproduce it outside EventHub. That is why we have to hash the Partition Keyes ourselves and send batches directy to the partitions. All we need is a hashing algorithm capable of uniformly hash Partition Keyes across partitions. It turns out that most hashing algorithms including MD5 can easily achieve this, although some might be cryptographically broekn. MD5 is a very quick and efficient algorithm hence is a good fit.

Psyfon

Now, all of what I have said so far - batching, buffering and hashing - have been implemented in an Open Source project called Psyfon. Using this library supporting both .NET Standard 2.0 and .NET 4.52, all you have to do is to create a single instance of BufferingEventDispatcher per process, start it and add events to it:

var singletonDispatcher = BufferingEventDispatcher("<connection string>");
singletonDispatcher.Start();

// and somewhere else in the code where events generated
var ed = new EventData(mySerialisedEventAsByteArray);
singletonDispatcher.Add(ed);

You can set a maximum byte size (according to the size of your events) and maximum number of seconds before committing the batches, whichever is reached earlier batch will be committed to the partition. I have tested it under high scale and essentially a single process had no issue sending 5000 EPS to EventHub. I will be publishing the results of a more extended test soon.

While my use case was a passthru API, this cane be equally used for dispatching monitoring and instrumentation events to the EventHub. PerfIt, another Open Source library will benefit from this very soon - watch the space.