Thursday 21 March 2013

Performance series - Memory leak investigation Part 1

[Level T2] If you have worked a few years within the industry, it is very unlikely that you have not encountered a case of memory leak. Whether your own code or someone else's, it is an annoying problem to diagnose and fix. The problem's scale and the effort to fix it increases exponentially with the scale and complexity of the application and deployment.

Many companies use application benchmarking and performance testing to find these problems early on. But as I experienced myself, problems found earlier as part of performance testing are not necessarily easier to solve. Lengthy cycle of gathering metrics, analysis, coming up with one or more hypothesis/es and then fixing and trying and testing it could be very time consuming and jeopardising delivery and meeting deadlines.

In these cases, focusing on what is important and what is not saves the day. A common problem in troubleshooting performance bugs is information overload and the existence of red herrings that confuse the picture. There is a myriad of performance counters that you could spend hours and days explaining their anomalies that have nothing to do with the problem you are trying to solve. Having worked as a medical doctor before, I have seen this in the field of medicine many many times. Human's body as a very big application - eternally more complex that any given application - demonstrates similar behaviour and I have learnt to be focused on identifying and fixing the problem at hand rather than explaining each and every oddity.

As such, I am starting this series with an overview of the performance counters and tools to be used. There are many posts and even books on this very topic. I am not claiming that this series is a comprehensive and all-you-need guide. I am actually not claiming anything. I am simply sharing my experience and hope it will make your troubleshooting journey less painful and help you find the culprit sooner.

A few notes on the platform

This is exclusively a Windows and mainly a .NET series - although part of this series could help with none .NET applications. Also focus is mainly on ASP.NET web applications as the impact of even a small memory leak can be very big but the same guidelines can be equally applied to desktop applications.

What is a memory leak?

Memory leak is usually considered as the memory which is allocated by the application and becomes inaccessible - as such cannot be deallocated. But I do not like this definition. For example this code does not qualify for memory leak but it will lead to out of memory:

var list = new List<byte[]>();
for(i=0; i< 1000000; i++)
   list.Add(new byte[500 * 1024]); // 500 KB

As can be seen, data is accessible to the application but application does not unload the allocated memory and keeps piling up memory allocation in heap.

For me, memory leak is a constant/stepwise allocation of memory without deallocation which usually leads to OutOfMemoryException. The reason I say usually is because small leaks can be tolerated as many desktop applications are closed at the end of the day and many website daily recycle their app pools. However, they are still leaks and you have to fix them when you get the time since with change in conditions, leaks that you could tolerate can bring down your site/application.

Process memory/time in 4 different scenarios

In the top right diagram, we see constant allocation but it coincides with deallocation of memory - as such not a leak. This scenario is common in ASP.NET Caching (HttpRuntime.Cache) where 50% of the cache is purged when memory used reaches a certain threshold.

In the bottom right diagram, de-allocation does not happen or its rate is the same as the allocation when the memory size of the application's process reaches a threshold. This pattern can be seen with the SQL Server where it uses as much memory as it can until it reaches a threshold.

How do I establish a memory leak?

All you have to do is to ascertain your application meets the criteria described above. Just observe the memory usage over time under load. It is possible that the leak happens only in a certain condition or when using a particular functionality of the app so you have to be careful about that.

What tool do you need to do that? Even using a Task Manager could be good enough to start with.

Tools to use 

Normally you would start with Task Manager. Just a quick look at the Task Manager or eye-balling the memory size of the process when the app is running and under load is a simple but effective start.

The mainstay of the memory leak analysis is Windows Performance Monitor, aka Perfmon. This is very important in establishing a benchmark, monitoring changes over new version releases and identifying problems. In this post I will look into some essential performance counters.

If you are investigating a serious or live memory leak, you have to have a .NET Memory Profiler. Currently 3 different profilers exist:

  1. RedGate's ANTS Profiler
  2. Jetbrain's dotTrace
  3. Scitech's Memory Profiler

Each tool has its own pros and cons. I have experience with SciTech's and I must say I am really impressed by it.

The most advanced tool is Windbg which is a really useful tool but has a steep learning curve. I will look at using it for memory leak analysis in future posts.

Initial analysis of the memory leak

Let's look at a simple memory leak (you can find the snippets here on GitHub):

var list = new List<byte[]>();
for (int i = 0; i < 1000; i++)
    list.Add(new byte[1 * 1000 * 1000]);

Process\Private Bytes performance counter is a popular measure of total memory consumption. Let's look at this counter in this app:

So how do we find out if it is a managed or unmanaged leak? We use .NET CLR Memory\#Bytes in all heaps counter:

As can be seen above, white private bytes increases constantly, heap allocation happens in steps. This stepwise increase of heap is consistent with a managed memory leak. In contrast let's look at this unmanaged leak code:

var list = new List<IntPtr>();
for (int i = 0; i < 400; i++)

And here is flat bytes in heap:

So this case is an unmanaged memory leak.

In the next post I will look at a few more performance counter and GC issues.

1 comment:

Note: only a member of this blog may post a comment.