Wednesday 16 May 2018

CacheCow.Client 2.0: HTTP Caching for your API Calls

CacheCow 2.0 Series:

State of Client HTTP Caching in .NET

Before CacheCow, the only way to use HTTP caching was to use Windows/IE caching through WebRequestHandler as Darrel explains here. AFAIK, this class no longer exists in .NET Standard due to its tight coupling with Windows/IE implementations.

I set out to build a store-independent caching story in .NET around 6 years ago and named it CacheCow and after these years I am still committed to maintain that effort.

Apart from a full-blown HTTP Caching, I had other ambitions in the beginning, for example, I had plans so you could limit caching per domain, etc. It became evident that this story is neither a critical feature nor possible in all storage techs. The underlying data structure requirement for cache storage is key-value while this feature required more complex querying. I accomplished implementing it for some storages but never was really used. That is why I no longer pursue this feature and it has been removed from CacheCow.Client 2.0. It is evident that unlike browsers, virtually all HttpClient instances would communicate with a handful of APIs and storage in this day and age is hardly a problem.

CacheCow.Client Features

The features of CacheCow 2.0 is pretty much unchanged since 1.x other than that now it supports .NET Standard 2.0+ hence you can use it in .NET Core and on platforms other than Windows (Linux/Mac).

In brief:
  • Supporting .NET 4.52+ and .NET Standard 2.0+
  • Store cacheable responses 
  • Supports In-Memory and Redis storages - SQL is coming too (and easy to build your own)
  • Manage separate query/storage of representations according to server's Vary header
  • Validating GET calls to validate cache after expiry
  • Conditional PUT calls to modify a resource only if not changed since (can be turned off)
  • Exposing diagnostic x-cachecow-client header to inform of the caching result
Using CacheCow.Client is effortless and there are hardly any knobs to adjust - it hides away all the caching cruft that can get in your way of consuming an API efficiently.

CacheCow.Client has been created as a DelegatingHandler that needs to be added to the HttpClient's HTTP pipeline to intercept the calls. We will look at some use typical use cases.

Basic Use Case

Let's imagine you have a service that needs to consume a cacheable resource and you are using HttpClient. Here are the steps to follow:

Add a Nuget dependency to CacheCow.Client

Use command-line or UI to add a dependency to CacheCow.Client version 2.x:
> install-package CacheCow.Client

Create an HttpClient 

CacheCow.Client provides a helper method to create an HttpClient with caching enabled:
var client = ClientExtensions.CreateClient();
All this does is to create an HttpClient with CacheCow's CachingHandler added to the pipeline fronted by the HttpClientHandler.

You can pass the cache store (an implementation of ICacheStore) in an overload of this method but here we are going to use the default In-Memory store suitable for our use case.

Make two calls to the cacheable resource

Now we make a GET call to get a cacheable resource and then another call to get it again. From examining the CacheCow header we can ascertain second response came directly from the cache and never even hit the network.
const string CacheableResource = "https://code.jquery.com/jquery-3.3.1.slim.min.js";
var response = client.GetAsync(CacheableResource).
      ConfigureAwait(false).GetAwaiter().GetResult();
var responseFromCache = client.GetAsync(CacheableResource).
      ConfigureAwait(false).GetAwaiter().GetResult();
Console.WriteLine(response.Headers.GetCacheCowHeader().ToString()); 
// outputs "2.0.0.0;did-not-exist=true"
Console.WriteLine(responseFromCache.Headers.GetCacheCowHeader().ToString()); 
// outputs "2.0.0.0;did-not-exist=false;retrieved-from-cache=true"

Using alternative storages - Redis

If you have 10 boxes calling an API and they are using an In-Memory store, the response would have to be cached separately on each box and the origin server will be hit potentially 10 times. Also due to dispersion, usefulness of the cache is reduced and you will see lower cache hit ratio.

In high-throughput scenarios you would want to use a distributed cache such as Redis. CacheCow.Client 1.x used to support Azure Fabric Cache (discontinued by Microsoft), two versions of Memcached, SQL Server, ElasticSearch, MongoDB and even File. Starting with 2.x, new storages will be added only when they absolutely make sense. There is a plan to migrate SQL Server storage but as for the others, there is currently no such plans. Having said that, it is very easy to implement your own and we will look into this further down in this post (I have chosen LMDB, a super fast file-based storage by Howard Chu).

For this case, we would like to use Redis storage. In case you do not have access to an instance of Redis, you can download (Mac/Linux or Windows) and run Redis locally without installation.

Add a dependency to Redis store package

After running your Redis (or perhaps creating one in the cloud), add a dependency to CacheCow.Client.RedisCacheStore:
> install-package CacheCow.Client.RedisCacheStore

Create an HttpClient with a Redis store

We use the ClientExtensions to create a client with a Redis store - here it connects to a local cache:

var client = ClientExtensions.CreateClient(new RedisStore("localhost")); 

CacheCow.Client.RedisCacheStore library uses StackExchange.Redis, the de-facto Redis client library in .NET, hence it can accepts connection string according to StackExchange.Redis conventions as well as IDatabase, etc to initialise the store.

Make two calls to a cacheable resource

Rest of the code is the same as with In-Memory scenario, making two HTTP calls to the same cacheable resource and observing the CacheCow headers - see above.

Cache Validation

Cacheable resources provide a validator so that the client can validate whether the version they have is still current. This was explained in the previous post, but essentially representation's ETag (or Last-Modified) header gets used to validate the cached resource with the server. CacheCow.Client already does this for you so you do not have to worry about it.

Another aspect of cache validation is on PUT calls so that the resource gets modified only if it has not changed since you have received it. This is essentially optimistic concurrency which is beautifully implemented in HTTP using validators. CacheCow.Client does this by default but there is a property on CachingHandler if you need to turn it off. In case you would wish to do so (or to change any other aspect of the CachingHandler), create the client without the ClientExtension:
var handler = new CachingHandler()
{
 InnerHandler = new HttpClientHandler(),
 UseConditionalPut = false
};

var c = new HttpClient(handler);
There are bunch of other knobs that are provided for some edge cases so you could modify the default behaviour but they are pretty self-explanatory and not worth going into much details. Just browse public properties of CachingHandler and GitHub or StackOverflow is the best place to discuss if you have a question.

Supporting other storages - implementing ICacheStore for LMDB

CacheCow separates the storage from the HTTP caching functionality hence it is possible to plug-in your own storage with a few lines of code.

ICacheStore is a simple interface with 4 async methods:
public interface ICacheStore : IDisposable
{
    Task<HttpResponseMessage> GetValueAsync(CacheKey key);
    Task AddOrUpdateAsync(CacheKey key, HttpResponseMessage response);
    Task<bool> TryRemoveAsync(CacheKey key);
    Task ClearAsync();

}
LMDB is a lightning-fast database (as the name implies) that has a support in .NET, thanks to Cory Kaylor for his OSS project Lightning.NET. The project needs some more love and care fixing some of the build issues and updating to the latest frameworks but it is a great work.

This scenario is useful especially if you need a local persistent store.

The implementation is pretty straightforward and we use Put, Get, Delete and Truncate methods of LightningTransaction to implement UpdateAsync, GetValueAsync, TryRemoveAsync and ClearAsync functionality. For Dispose, we just need to dispose the lightning environment.

Here is a pretty typical implementation:
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using CacheCow.Client;
using CacheCow.Client.Headers;
using CacheCow.Common;
using LightningDB;

namespace CacheCow.Client.Lightning
{
    public class LightningStore : ICacheStore
    {
        private readonly LightningEnvironment _environment;
        private readonly string _databaseName;
        private readonly MessageContentHttpMessageSerializer _serializer = new MessageContentHttpMessageSerializer();
        
        public LightningStore(string path, string databaseName = "CacheCowClient")
        {
            _environment = new LightningEnvironment(path);
            _environment.MaxDatabases = 1;
            _environment.Open();
            _databaseName = databaseName;
        }

        public async Task AddOrUpdateAsync(CacheKey key, HttpResponseMessage response)
        {
            var ms = new MemoryStream();
            await _serializer.SerializeAsync(response, ms);
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName, 
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.Put(db, key.Hash, ms.ToArray());
                tx.Commit();
            }
        }

        public Task ClearAsync()
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.TruncateDatabase(db);
                tx.Commit();
            }

            return Task.CompletedTask;
        }

        public void Dispose()
        {
            _environment.Dispose();
        }

        public async Task<HttpResponseMessage> GetValueAsync(CacheKey key)
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                var data = tx.Get(db, key.Hash);
                if (data == null || data.Length == 0)
                    return null;
                var ms = new MemoryStream(data);
                return await _serializer.DeserializeToResponseAsync(ms);
            }
        }

        public Task<bool> TryRemoveAsync(CacheKey key)
        {
            using (var tx = _environment.BeginTransaction())
            using (var db = tx.OpenDatabase(_databaseName,
                new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                tx.Delete(db, key.Hash);
                tx.Commit();
            }

            return Task.FromResult(true);
        }                
    }
}

Conclusion

CacheCow.Client is a simple and straightforward to get started with. It supports In-Memory and Redis storages and storage of your choice can be plugged-in with a handful lines of code - here we demonstrated that for LMDB. It is capable of carrying out GET and PUT validation, making your client more efficient and your data more consistent.

In the next post, we will look into Server scenarios in ASP.NET Core MVC.

Sunday 13 May 2018

CacheCow 2.0 is here - now supporting .NET Standard and ASP.NET Core MVC


CacheCow 2.0 Series:

So, no CacheCore in the end!

Yeah. I did announce last year that the new updated CacheCow will live under the name CacheCore. The more I worked on it, the more it became evident that only a tiny amount of CacheCow will ever be Core-related. And frankly trends come and go, while HTTP Caching is pretty much unchanged for the last 20 years.

So the name CacheCow lives on, although in the end what matters for a library is if it can solve any of your problems. I hope it will and carry on doing so. Now you can use CacheCow.Client with .NET 4.52+ and .NET Standard 2.0+. Also CacheCow.Server also supports both Web API and ASP.NET Core MVC - and possibly Nancy soon!

CacheCow 2.0 has lots of documentation and the project now has 3 sample projects covering both client and server sides in the same project.

CacheCow.Server has changed radically

The design for the server-side of the CacheCow 0.x and 1.x was based on the assumption that your API is a pure RESTful API and the data only changes through calling its endpoints so the API layer gets to see all changes to its underlying resources. The more I explored and over the years, this turned out to be a pretty big assumption in the end, and is realistic only in the REST La La Land - a big learning for me. And even if the case is true, the relationship between resources resulted in server-side cache directive management to be a mammoth task. For example in the familiar scenario of customer-product-orders, if an order changes, the cache for the collection of orders is now invalidated - hence the API needs to understand which resource is collection of which. What is more, change in customer could change the order data (depending on implementation of course, but just the take it for the sake of argument). So it meant that the API now has to know a lot more: single vs collection resources, relationship between resources... it was a slippery slope to a very bad place.

With removing that assumption, the responsibility now lies with the back-end stores which provide data for the resources - they will be queried by a couple of constructs added to CacheCow.Server. If you opt to implement that part for your API, then you have a supper-efficient API. If not, there are some defaults there to do the work for you - although super-optimal. All of this will be explained in the CacheCow.Server posts, but the point is CacheCow.Server is now a clean abstraction for HTTP Caching, as clean as I could make it. Judge for yourself.

What is HTTP Caching?

Caching is a very familiar notion in programming and pretty much every developer uses it on a regular basis. This familiarity has a downside to it since HTTP Caching is more complex and in many ways different to the routing caching in code - hence it is very common to see misunderstandings even amongst senior developers. If you ask an average developer this question: "In HTTP Caching, where the cache data gets stored?" it is probably more likely to hear the wrong answer "server" than the correct answer "client". In fact, many developers are looking for to improve their server-side code's performance by turning on the caching on the server, while if the callers ignore the caching directives it will not result in any benefit.

This reminds me of a blog post I wrote 6 years ago where I used HTTP Caching as an example of mixed-concern (as opposed to server-concern or client-concern) where "For HTTP caching to work, client and server need to work in tandem".  This a key difference with the usual caching scenarios seen everyday. What makes HTTP Caching even more complex is the concurrency primitives, built-in starting with HTTP 1.1 - we will look into those below. 

I know HTTP Caching is hardly new and has been explained many times before. But considering number of times I have seen being completely misunderstood, I think it deserves your 5-10 minutes - even though as refresher.


Resources vs. Representations

REST advocates exposing services through a uniform API (where HTTP is one such implementation) allowing resources to be created, modified and queried using the API. A resource is addressed by its location identifier or URL (e.g. /api/car/123). When a client requests a resource, only a representation of the resource is sent back. This means that the client receives only a representation out of many possible representations. This also would mean that when the client caches the representation, this representation is only valid if the the representation requested matches the one cached. And finally, a client might cache different representations of the same resource. But what does all of this mean?

HTTP GET - The server serving a representation of the resource. Server also send cache directives.
A resource could be represented differently in terms of format, encoding, language and other presentation concerns. HTTP provides semantic for the client to express its preferences in such concerns with headers such as Accept, Accept-Language and Accept-Encoding. There could be other headers that can result in alternative representations. The server is responsible for returning the definitive list of such headers in the Vary header.

Cache Directives

Server is responsible for returning cache directives along with the representation. Cache-Control header is the de-factor cache directive defining whether the representation can be cached, for how long, whether by the end client or also by the HTTP intermediaries/proxies, etc. HTTP 1.0 had the simple Expires header which only defined absolute expire time of the representation.

You could also think of other cache-related headers as cache directives (although purely speaking they are not) such as ETag, Last-Modified and Vary.

Resource Version Identifiers (Validators)

HTTP 1.1 defines ETag as an opaque identifier which defines the version of the resource. ETag (or EntityTag) can be strong or weak. Normally a strong ETag identifies version of the representation while a weak ETag is only at the resource level.

Last-Modified header was the main validator in HTTP 1.0 but since it is based on a date with up-to-a-second precision, it is not suitable for achieving high consistency since a resource could change multiple times in a second.

CacheCow supports both validators (ETag and Last-Modified) and combines these two notions in the construct TimedETag.

Validating (conditional) HTTP Calls

A GET call can request the server for the resource with the condition that the resource has been modified with respect to its validator. In this case, the client sends ETag(s) in the If-None-Match header or Last-Modified date in the If-Modified-Since header. If validation matches and no change was made, the server returns status 304 otherwise the resource is sent back.

For a PUT (and DELETE) call, the client sends validators in If-Match or If-Unmodified-Since.  The server performs the action if validation matches otherwise status 412 is sent back.

Consistency

The client normally caches representations longer than the expiry and after the expiry it resorts to validating calls and if they succeed it can carry on using the representations.

In fact the sever can return representations with immediate expiry forcing the client to validate every time before using the cache resource. This scenario can be called High-Consistency caching since it ensures the client always uses the most recent version.

Is HTTP Caching suitable for my scenario?

Consider using HTTP Caching if:
  • Both your client and server are cache-aware. The client either is a browser which is the ultimate HTTP machine well capable of handling cache directives or a client that understands caching such as HttpClient + CacheCow.Client.
  • You need a High-Consistency caching and you cannot afford clients to use outdated data
  • Saving on network bandwidth is important

HTTP Caching is unsuitable for you if:
  • Your client does not understand/implement HTTP caching
  • The server is unable to provide cache directives


In the next post, we will look into CacheCow.Client.