When the systems become more complex and you have more layers between your user interface and database, not knowing the ID of the aggregate root before storing can pose unnecessary challenges. On the other hand, the cost of choosing GUID as the primary key of the database is usually unknown. That is why I set out to find out the exact cost of choosing GUID instead of IDENTITY INT as the primary key of the table.
Pros and cons
This is not a new topic. There are quite a few resources that discuss this topic and list the pros and cons. Jeff Atwood's blog post from 6 years ago is a short and sweet one explaining why StackOveflow decided to move the IDs (at least some of them) to GUID. It also contains many references to other related posts that discuss the topic - all of which are good reads especially this one.
For me, the main reason to choose a GUID is to know the ID of the entity even before persisting. If you do not have this requirement, you might still want to use GUID and avoid IDENTITY INT especially if you are using sharding or master-master replication. But for me, the first one is vital. Why?
If you use CQRS and commands to store your entities, you would naturally implement commands as an asynchronous operation. Your UI would have to find out about the result of the operation by polling or subscribing to the event published after processing the command. If you do not know the ID of your entity before persistence, you would end up using a different ID as a reference.
Problem with GUIDsApart from GUIDs being unreadable, there are performance implications for using GUIDs. The main ones are:
- GUID is a 16-byte type while INT is a 4-byte type. With storage nowadays very cheap, this normally is not a problem from the storage point of view. The problem is that reads (and writes) will be multiplied by 4x.
- The most serious problem is when you use the GUID ID as the clustered index of the table. In this case, with every INSERT you would be changing the layout of the data potentially having to move many data pages. This is unlike IDENTITY INT clustered indexes where data is stored sequentially providing the best performance.
BenchmarksThis old article provides a benchmark of the cost of the writes. Considering very little has been changed with the basic functionality of the SQL Server, this benchmark is still relevant - if not accurate. According to the results, cost of writing GUID primary keys in a database containing 1,000,000 records is 10x the cost of writing INT primary keys. This cost goes up exponentially when the table contains more rows.
One of the solutions presented in the article is to have a semi-sequential GUID generated in the database. In fact since SQL Server 2005, there is an option to do that using NEWSEQUENTIALID(). This will change the overhead to a mere linear 10% on both reads and writes.
The problem with these solutions is that the GUID is still generated in database so does not solve the problem of knowing the ID before storing the entity.
My solution and benchmarkAs discussed, the problem generally is to do with the non-sequential nature of the GUIDs. But who said we should use the primary key as the clustered index?!
Basically my solution is to:
- Add an IDENTITY INT column (I called it order Id - nothing to do with customer orders) and set it to be the clustered index. If you do not want to use an additional column, you could use a DATETIME column to store a timestamp which is usually very useful
- Add a UNIQUEIDENTIFIER column (GUID data type in SQL Server) as the main ID. Set that to be the primary key.
- Add a non-clustered index for the GUID ID
This will keep the data storage sequential while benefiting from system generated IDs instead of database generated IDs. There are two drawbacks: 1) storing an additional column 2) having an additional non-clustered index which takes some space and makes reads and writes slightly slower. Bear in mind, this is recommended only if you keep the table only for transactional usage and to store and retrieve by key and not for reporting. In most cases, this is what you would normally do especially if you are implementing CQS or CQRS.
|* Using an IDENTITY INT as the clustered index|
As can be seen, using this technique, performance of the GUID primary key is close to the performance of the INT primary key (roughly 10% overhead).