Sunday, 10 February 2013

CacheCow project - Current status and future roadmap

For those of you who might not know, CacheCow is an open source project that I started around 8 months ago to address the server-side and client-side caching in the new Microsoft HTTP stack - also known as ASP.NET Web API. I have been fortunate to get regular help, feedback and contributions from geek friends  (Tugberk, Alex Zeitler, Sayed Hashimi) and I am pleased to announce that version 0.3 is out now. As part of this release, cache storage implementation for Memcached has been released.

There are currently 9 packages available - have a look here.

CacheCow.Server

This is the server side component of CacheCow which works as a DelegatingHandler and looks at the request/response and implements server-side caching according to HTTP spec. This includes inspecting headers, generating ETag and adding headers, responding to conditional GET and PUT, etc. The idea is that you should only have to declare your caching strategy against the handler and it should take care of the rest.

Storing ETag and other important headers such as LatsModified (which I collectively call Cache Metadata) against the resources requires a persistent or in-memory storage. CacheCow.Server by default comes with an In-Memory implementation of the storage. However, these other storages are currently available:
  1. MongoDb
  2. RavenDB
  3. SQL Server

CacheCow.Client

As part of ASP.NET Web API, a new shiny HttpClient has been developed which uses the same HTTP pipeline model as the server. While you might be consuming Web API using JavaScript on the browser, I see that more often than not, this will be consumed by HttpClient on the server as more and more APIs move to Web API from WCF and classic (ASMX) web services.

As such, I believe the client side story of the caching is an important - and somehow neglected - one. So you will find a client-side CachingHandler in CacheCow.Client. Now on the client instead of storing Cache Metadata, we need to store actual responses. There are even more storage implementations available on the client:
  1. Redis
  2. File-based storage
  3. SQL Server
  4. Memcached

Future roadmap

There is a lot of work there to be done. [If you would like to contribute, please let me know]

Basically the plan is to implement below in the next 4-6 months:
  1. Turning all storage API to asynchronous. So IEntityTageStore and ICacheStore will turn completely async.
  2. Implementing total and per-domain storage quota for the CacheCow.Client. Part of this work has already been done but need to be consolidated.
  3. Implement all storages for both client and server:
    • Server-side:
      • Redis
      • Memcached
    • Client-side
      • RavenDb
      • MongoDB
  4. Implement resource organisation so declaring cache plan for a resource is easy (currently is not)
So watch this space. If you would like to contribute, please message me in twitter.

Saturday, 2 February 2013

Performance series: GUIDs vs. IDENTITY INT in RDBMS

[Level C2] We have been discussing with a few colleagues about whether we should adopt GUIDs as our primary keys. At the same time, we had a meeting with our DBAs to discuss other scenarios and the topic came up.

When the systems become more complex and you have more layers between your user interface and database, not knowing the ID of the aggregate root before storing can pose unnecessary challenges. On the other hand, the cost of choosing GUID as the primary key of the database is usually unknown. That is why I set out to find out the exact cost of choosing GUID instead of IDENTITY INT as the primary key of the table.

Pros and cons

This is not a new topic. There are quite a few resources that discuss this topic and list the pros and cons. Jeff Atwood's blog post from 6 years ago is a short and sweet one explaining why StackOveflow decided to move the IDs (at least some of them) to GUID. It also contains many references to other related posts that discuss the topic - all of which are good reads especially this one.

Why GUID?

For me, the main reason to choose a GUID is to know the ID of the entity even before persisting. If you do not have this requirement, you might still want to use GUID and avoid IDENTITY INT especially if you are using sharding or master-master replication. But for me, the first one is vital. Why?

If you use CQRS and commands to store your entities, you would naturally implement commands as an asynchronous operation. Your UI would have to find out about the result of the operation by polling or subscribing to the event published after processing the command. If you do not know the ID of your entity before persistence, you would end up using a different ID as a reference.

Problem with GUIDs

Apart from GUIDs being unreadable, there are performance implications for using GUIDs. The main ones are:

  1. GUID is a 16-byte type while INT is a 4-byte type. With storage nowadays very cheap, this normally is not a problem from the storage point of view. The problem is that reads (and writes) will be multiplied by 4x.
  2. The most serious problem is when you use the GUID ID as the clustered index of the table. In this case, with every INSERT you would be changing the layout of the data potentially having to move many data pages. This is unlike IDENTITY INT clustered indexes where data is stored sequentially providing the best performance.

Benchmarks

This old article provides a benchmark of the cost of the writes. Considering very little has been changed with the basic functionality of the SQL Server, this benchmark is still relevant - if not accurate. According to the results, cost of writing GUID primary keys in a database containing 1,000,000 records is 10x the cost of writing INT primary keys. This cost goes up exponentially when the table contains more rows.

One of the solutions presented in the article is to have a semi-sequential GUID generated in the database. In fact since SQL Server 2005, there is an option to do that using NEWSEQUENTIALID(). This will change the overhead to a mere linear 10% on both reads and writes.

The problem with these solutions is that the GUID is still generated in database so does not solve the problem of knowing the ID before storing the entity.

My solution and benchmark

As discussed, the problem generally is to do with the non-sequential nature of the GUIDs. But who said we should use the primary key as the clustered index?!

Basically my solution is to:

  1. Add an IDENTITY INT column (I called it order Id - nothing to do with customer orders) and set it to be the clustered index. If you do not want to use an additional column, you could use a DATETIME column to store a timestamp which is usually very useful
  2. Add a UNIQUEIDENTIFIER column (GUID data type in SQL Server) as the main ID. Set that to be the primary key.
  3. Add a non-clustered index for the GUID ID

This will keep the data storage sequential while benefiting from system generated IDs instead of database generated IDs. There are two drawbacks: 1) storing an additional column 2) having an additional non-clustered index which takes some space and makes reads and writes slightly slower. Bear in mind, this is  recommended only if you keep the table only for transactional usage and to store and retrieve by key and not for reporting. In most cases, this is what you would normally do especially if you are implementing CQS or CQRS.

* Using an IDENTITY INT as the clustered index

As can be seen, using this technique, performance of the GUID primary key is close to the performance of the INT primary key (roughly 10% overhead).

Conclusion

Performance of a GUID primary key is acceptable and adds mere a 10% overhead, if we use another INT or DATETIME timestamp as the clustered index.