profilers – Never Ending Journey

JustMock Lite is now open source

A few weeks ago something important happened. I am very happy to say that JustMock Lite is now open source. You can find it on GitHub.

In this post I would like to share some bits of history. JustMock was my first project at Telerik. It was created by a team of two. The managed API was designed and implemented by my friend and colleague Mehfuz and I implemented the unmanaged CLR profiler. The project was done in a very short time. I spent 6 weeks to implement all the functionality for so called elevated mocking. This includes mocking static, non-virtual methods and sealed types. After a few iterations JustMock was released early in April 2010.

I remember my very first day at Telerik. I had a meeting with Hristo Kosev and together we set the project goals. It turned out JustMock was just an appetizer for JustTrace. Back then we did not have much experience with the CLR unmanaged profiling API and Hristo wanted to extend Telerik product family with a performance and memory profiling tool. So, the plans were to start with JustMock and gain know-how before we build JustTrace. Step by step, we extended the team and JustMock/JustTrace team was created. Here is the door sign that the team used to have.

Later the team changed its name to MATTeam (mocking and tracing team).

Looking back, I think we built two really good products. As far as I know, at the time of writing this post JustMock is still the only tool that can mock of the most types from mscorlib.dll assembly. JustTrace also has its merits. It was the first .NET profiler with support for profiling managed Windows Store apps. I left MATTeam an year ago and I hope soon I can tell you about what I work on. Stay tuned.

Native code profiling with JustTrace

The latest JustTrace version (Q1 2014) has some neat features. It is now possible to profile unmanaged applications with JustTrace. In this post I am going to show you how easy it is to profile native applications with JustTrace.

For the sake of simplicity I am going to profile notepad.exe editor as it is available on every Windows machine. First, we need to setup the symbol path folder so that JustTrace can decode correctly the native call stacks. This folder is the place where all required *.pdb files should be.

In most scenarios, we want to profile the code we wrote from within Visual Studio. If your build generates *.pdb files then it is not required to setup the symbols folder. However, in order to analyze the call stacks collected from notepad.exe we must download the debug symbols from Microsoft Symbol Server. The easiest way to obtain the debug symbol files is to use symchk.exe which comes with Microsoft Debugging Tools for Windows. Here is how we can download notepad.pdb file.

symchk.exe c:\Windows\System32\notepad.exe /s SRV*c:\symbols*http://msdl.microsoft.com/download/symbols

[Note that in order to decode full call stacks you may need to download *.pdb files for other dynamic libraries such as user32.dll and kernelbase.dll for example. With symchk.exe you can download debug symbol files for more than one module at once. For more details you can check Using SymChk page.]

Now we are ready to profile notepad.exe editor. Navigate to New Profiling Session->Native Executable menu, enter the path to notepad.exe and click Run button. Once notepad.exe is started, open some large file and use the timeline UI control to select the time interval of interest.

In closing, I would say that JustTrace has become a versatile profiling tool which is not constrained to the .NET world anymore. There are plenty of unmanaged applications written in C or C++ and JustTrace can help to improve their performance. You should give it a try.

Profiing Data Visualization

Every .NET performance profiling tool offers some form of data visualization. Usually, the profiling data is shown in a hierarchical representation such as calling context tree (CCT) or calling context ring chart (CCRC). In this post I would like to provide a short description of the most commonly used profiling data visualizations.

In general, CCT is well understood. Software developers find CCT easy to work with as it represents the program workflow. For example, if method A() calls method B() which in turn calls method C() then the CCT will represent this program workflow as follows:

A() -> B() -> C()

CCT data contains the time that is spent inside each method (not shown here for the sake of simplicity). Here is a short list of some .NET profilers that use CCT/CCRC to visualize data:

While CCT is useful and easy to understand data visualization it has limitations. Often we create big applications with complex program workflows. For such big applications the CCT navigation becomes harder. Often the CCT size becomes overwhelming and the developers cannot grasp the data. To understand big CCT the profiling tools offer some form of aggregation. The most common aggregation is so called Hot Spot tree (HST). Sometimes it is called caller context tree but for the purpose of this post we will use former name. Here is the HST for our previous example:

C() <- B() <- A()

We said that HST is a form of aggregation but we didn’t explain what and how we aggregate. HST aggregates CCT nodes by summing the time spent inside a method for each unique call path. Let’s make it more concrete with a simple example. Suppose we have an application with the following program workflow (CCT):

A() -> B() -> C(/* 4s */)
           |
           |--> D() -> C(/* 6s */)

The time spent in method C() is 4 seconds when it is called from B() and the time spent in method C() is 6 seconds when it is called from D(). So, the total time spent in method C() is 10 seconds. We can build HST for method C() by aggregating the time for each unique call path.

C(/* 10s */) <- B(/* 4s */) <- A(/* 4s */)
              |
              |-- D(/* 6s */) <- B(/* 6s */) <- A(/* 6s */)

HST shows how the time spent in method C() is distributed for each unique call path. If we build HST for every method it becomes obvious why HST is so useful data visualization. Instead of focusing on the whole CCT which may contain millions of nodes we can focus on the top, say, 10 most expensive HSTs as they show us the top 10 most time expensive methods. In fact, I find HST so useful that I can argue that showing CCT is not needed at all when solving difficult performance issues.

I would like to address the last sentence as it is related to the DIKW pyramid. While CCT is useful profiling data visualization, it is mostly about data. Data is just numbers/symbols. It cannot answer “who”, “what”, “where” and “when” questions. Processing CCT into HSTs transforms data into information. HSTs can answer where time is spent inside an application. I am not going to address all the theoretical details here but I would like to dig some details about performance profiling further.

We saw why HSTs are useful but sometimes we want to know more. For example, is our application CPU or I/O bound? Or maybe we are interested in the application dynamics (e.g. when it is CPU bound and when it is I/O bound). Component interaction is also an important question for many of us. The software vendors of profiling tools recognize these needs and try to build better products. For example Microsoft provides Tier Interaction Profiling, Telerik JustTrace provides Namespace grouping, JetBrains dotTrace provides Subsystems, SpeedTrace offers Layer Breakdown and so on. While all these visualizations are useful, sometimes a simple diagram works even better.

The point is that there is no silver bullet. A single profiling data visualization cannot answer every question. I think ideas like Profiling Query Language (PQL) have a lot of potential. It doesn’t matter if there will be PQL or LINQ to some well established domain model (e.g. LINQ to Profiling Data). The language is only a detail. The important thing is that the collected data should be queryable. Once the data is queryable the developer can do the proper queries. Of course, each profiling tool can be shipped with a set of predefined common queries. I hope we will see PQL in action very soon 😉

Analyzing crash dumps with ClrMD

Microsoft recently released the first beta version of Microsoft.Diagnostics.Runtime component. Lee Culver describes ClrMD as follows:

ClrMD is a set of advanced APIs for programmatically inspecting a crash dump of a .NET program much in the same way as the SOS Debugging Extensions (SOS). It allows you to write automated crash analysis for your applications and automate many common debugger tasks.

Lee Culver also showed a nice LINQ query over heap objects. This example reminds me the SharpDevelop approach to Profiler Query Language (PQL). Here is the short description of PQL:

PQL v1.1 (SharpDevelop 4.x)
- Extend PQL capabilities (Views and Categories)
- PQL code completion and editing improvements
- PQL performance improvements (optimized LINQ implementation LINQ-to-Profiler)

I am very keen on PQL concept and the ClrMD component makes it possible. I decided to have a high-level overview of ClrMD implementation so I loaded the assembly in JustDecompile.

I was not surprised to see that Microsoft reused a lot of code from PerfView which is built around IXCLRDataProcess interface. Actually the latest PerfView version (1.4.1.0) already comes with ClrMD component instead of ClrMemDiag assembly. Something else caught my eye though. This is the Redhawk namespace. And this is a very nice surprise indeed.

For those who are not familiar with Redhawk I would recommend the following Channel 9 page. As far as I know, there is nothing official about Redhawk yet and the ClrMD component might be the first clue. What is coming next? Only time will tell.

JustTrace Q1 2013 SP1 is out

The last two weeks I was busy with the latest release of JustTrace. We had some hard times with JustTrace Q1 2013 release but after all we managed to fix a critical bug and we shipped SP1 after one week. There are a lot of new features in this release but I am going to focus on one thing only:

Visual Studio integration is out-of-process

If you use Visual Studio 2012 then you probably noticed a new process XDesProc.exe on your machine.

This is the new XAML UI Designer. Lets use Spy++ tool on the XAML designer in Visual Studio 2012 and see what’s interesting.

If you convert 0x1554 to decimal number you will get 5460 and that’s the XDesProc.exe process id shown on the screen shot above. So, what we see in Visual Studio is actually a window that belongs to XDesProc.exe process. Moving the XAML UI Designer out of Visual Studio has a lot of benefits – better isolation, improved performance and higher resource utilization. The same approach is used by Internet Explorer, Firefox and Chrome. You can read the full story about XDesProc.exe here.

So, why is it important to use out-of-process integration with Visual Studio? The short answer is more memory space. At present Visual Studio is 32bit process and this means that it can use at most 4GB (in most cases 2GB). This is a serious limitation for most profiling tools including JustTrace. For a long time, we wanted to move JustTrace out of Visual Studio and finally we did it. At the same time, JustTrace was built around the idea to provide seamless integration with Visual Studio. I am happy to say that the new JustTrace gives you the best of both worlds. Try it and give us your feedback.

Profiler types and their overhead

It is a common opinion that profiling tools are slow. Every time I stumble upon this statement I ask for the definition of slow. Most of the time I get the answer that a profiler is slow when it adds more than 100% overhead.

At present there many commercial profilers that are fast (according to the definition above). So, why don’t people use profiling tools then?

I think the confusion comes from the fact that there are different profiler types and some of them are fast while others are slow. Lets see what these profiler types are. It is common to classify profiling tools into two major categories:

memory profilers
performance profilers

Memory profilers are used when one wants to solve memory related issues like memory leaks, high memory consumption and so on. Performance profilers are used when one wants to solve performance related issues like high CPU usage or concurrency related problems. These categories are not set in stone though. For example too much memory allocation/consumption can cause performance issues.

Lets see why some performance profilers are fast while others are slow. All profilers, and performance profilers in particular, can be classified in yet another two categories:

event-based profilers (also called tracing profilers)
statistical profilers (also called sampling profilers)

Event-based profilers collect data on events from a well-defined event set. Such event set may contain events for enter/leave function, object allocation, thrown exception and so on. Statistical profilers usually collect data/samples on regular intervals (e.g. take a sample on every 5 milliseconds).

At first, it is not obvious whether event-based/tracing profilers are faster or slower than statistical/sampling ones. So, lets first have a look the the current OOP platforms. For the sake of simplicity we will have a look at the current .NET platform.

Each .NET application makes use of the .NET Base Class Library (BCL). Because of current OOP design principles most frameworks/libraries have a small set of public interfaces and a fair amount of private encapsulated APIs. Lets look at the picture above. As you see your application can call only a small number of public BCL interfaces while they in turn can call much more richer APIs. So, you see only the “tip of the iceberg”. It is a common scenario when a single call to a public BCL interface results in a few dozen private interface calls.

Lets have an application that runs for 10 seconds and examine the following two scenarios.

Scenario 1

The application makes heavy usage of “chatty” interface calls. It is easy to make 1000 method calls per second. In case of event-based/tracing performance profiler you have to process 20000 events (10000 enter function events + 10000 leave function events). In case of statistical/sampling performance profiler (assuming that the profiler collects data every 5 ms) your profiler have to process 2000 events. So, it is relatively safe to conclude that the tracing profiler will be slower than the sampling one. And this is the behavior that we most often see.

Scenario 2

Suppose your application is computation bound and performs a lot of loops and simple math operations. It is even possible that your “main” method calls a single BCL method (e.g. Console.WriteLine) only. In this case your event-based/tracing performance profiler have to process a few events only while the statistical/sampling performance profiler have to process 2000 events again. So, in this scenario is much safe to say that the tracing profiler will be faster than the sampling one.

In reality, statistical/sampling profilers have constant 2-10% overhead. Event-based/tracing profilers often have 300-1000% overhead.

Tracing or Sampling Profiler

The rule of thumb is that you should start with a sampling profiler. If you cannot solve the performance issue then you should go for a tracing profiler. Tracing profilers usually collect much more data that helps to get better understanding about the performance issue.

[Note: If you are not interested in the theoretical explanation you can skip the following two paragraphs.]

If you’ve read carefully the last sentence then you’ve seen that I’ve made the implication that the more data the profiler has collected the easier you are going to solve the performance problem. Well, that’s not entirely true. You don’t really need data. As Richard Hamming said “The purpose of computing is insight, not numbers”. So, we don’t need data but rather “insight”. How do we define “insight” then? Well, the answer comes from relatively young information management and knowledge management. We define data, information, knowledge and wisdom as follows:

data: numbers/symbols
information: useful data that helps to answer “who”, “what”, “where” and “when” questions; information is usually processed data
knowledge: further processed information; it helps to answer “how” questions
wisdom: processed and understood knowledge; it helps to answer “why” questions

So, it seems that we are looking for “information”. Here the algorithmic information theory comes to help. This theory is a mixture of Claude Shannon’s information theory and Alan Turing’s theory of computation. Andrey Kolmogorov and more recently Gregory Chaitin had defined quantitative measures of information. Though they followed different approaches an important consequence they made is that the output from any computation cannot contain more information than was the input in first place.

Conclusion

Drawing parallel back to profiling we now understand why sometimes we have to use event-based/tracing profilers. As always, everything comes at a price. Don’t be biased that profiling tools are slow. Use them and make your software better.

CLR Profilers and Windows Store apps

Last month Microsoft published a white paper about profiling Windows Store apps. The paper is very detailed and provides rich information how to build CLR profiler for Windows Store apps. I was very curious to read it because at the time when we released JustTrace Q3 2012 there was no documentation. After all, I was curious to know whether JustTrace is compliant with the guidelines Microsoft provided. It turns out it is. Almost.

At time of writing JustTrace profiler uses a few Win32 functions that are not officially supported for Windows Store apps. The only reason for this is the support for Windows XP. Typical example is CreateEvent which is not supported for Windows Store apps but is supported since Windows XP. Rather one should use CreateEventEx which is supported since Windows Vista.

One option is to drop the support for Windows XP. I am a bit reluctant though. At least such decision should be carefully thought and must be supported by actual data for our customers using Window XP. Another option is to accept the burden to develop and maintain two source code bases – one for Windows XP and another for Windows Vista and higher. Whatever decision we are going to make, it will be thoroughly thought out.

Let’s have a look at the paper. There is one very interesting detail about memory profiling.

The garbage collector and managed heap are not fundamentally different in a Windows Store app as compared to a desktop app. However, there are some subtle differences that profiler authors need to be aware of.

It continues even more interesting.

When doing memory profiling, your Profiler DLL typically creates a separate thread from which to call ForceGC. This is nothing new. But what might be surprising is that the act of doing a garbage collection inside a Windows Store app may transform your thread into a managed thread (for example, a Profiling API ThreadID will be created for that thread)

Very subtle indeed. For a detailed explanation, you can read the paper. Fortunately JustTrace is not affected by this change.

In conclusion, I think the paper is very good. It is a mandatory reading for anyone interested in building CLR profiler for Windows Store apps. I would encourage you to see CLR profiler implementation as well.

Profiling Tools and Standardization

Imagine you have the following job. You have to deal with different performance and memory issues in .NET applications. You often get questions from your clients “Why my application is slow and/or consumes so much memory?” along with trace/dump files produced by profiling tools from different software vendors. Yeah, you guess it right – your job is a tough one. In order to open the trace/dump files you must have installed all the variety of profiling tools that your clients use. Sometimes you must have different versions of a particular profiling tool installed, a scenario that is rarely supported by the software vendors. Add on top of this the price and the different license conditions for each profiling tool and you will get an idea why your job is so hard.

I wish I can sing “Those were the days, my friend” but I don’t think we have improved our profiling tools much today. The variety of trace/dump file formats is not justified. We need a standardization.

Though I am a C++/C# developer, I have a good idea what is going on in the Java world. There is no such variety of trace/dump file formats. In case you are investigating memory issues you will probably have to deal with IBM’s portable heap dump (PHD) file format or Sun’s HPROF. There is a good reason for this though. The file format is provided by the JVM. The same approach is used in Mono. While this approach is far from perfect it has a very important impact on the software vendors. It forces them to build their tools with a standardization in mind.

Let me give you a concrete example. I converted the memory dump file format of the .NET profiler I work on to be compatible with HPROF file format and then I used a popular Java profiler to open it. As you may easily guess, the profiler successfully analyzed the converted data. There were some caveats during the converting process, but it is a nice demonstration that with the proper level of abstraction we can build profiling tools for .NET and Java at the same time. If we can do this then why don’t we have a standardization for trace/dump files for .NET?

In closing, I think all software vendors of .NET profiling tools will benefit from such standardization. The competition will be stronger which will lead to better products on the market. The end-users will benefit as well.

Why do we need profiling tools?

Every project is defined by its requirements. The requirements can be functional and non-functional. Most of the time the developers are focused on functional requirements only. This is how it is now and probably it won’t change much in the near future. We may say that the developers are obsessed by the functional requirements. In the matter of fact, a few decades earlier the software engineers thought that the future IDE will look something like this:

This is quite different from nowadays Visual Studio or Eclipse. The reason is not that it is technically impossible. On contrary, it is technically possible. This is one of the reasons for the great enthusiasm of the software engineers back then. The reason this didn’t happen is simple. People are not that good at making specifications. Today no one expects to build large software by a single, huge, monolithic specification. Instead we practice iterative processes, each time implementing a small part of the specification. The specifications evolve during the project and so on.

While we still struggle with the tools for functional requirements verification we made a big progress with the tools for non-functional requirements verification. Usually we call such tools profilers. Today we have a lot of useful profiling tools. These tools analyze the software for performance and memory issues. Using a profiler as a part from the development process can be a big cost-saver. Profilers liberate software engineers from the boring task to watch for performance issues all the time. Instead the developers can stay focused on the implementation and verification of the functional requirements.

Take for example the following code:

FileStream fs = …
using (var reader = new StreamReader(fs, ...))
{
    while (reader.BaseStream.Position < reader.BaseStream.Length)
    {
       string line = reader.ReadLine();

       // process the line
    }
}

This is a straightforward and simple piece of code. It has an explicit intention – process a text file reading one line at a time. The performance issue here is that the Length property calls the native function GetFileSize(…) and this is an expensive operation. So, if you are going to read a file with 1,000,000 lines then GetFileSize(…) will be called 1,000,000 times.

Let’s have a look at another piece of code. This time the following code has quite different runtime behavior.

string[] lines = …
int i = 0;
while (i < lines.Length)
{
   ...
}

In both examples the pattern is the same. And this is exactly what we want. We want to use a well-known and predictive patterns.

Take a look at the following 2 minutes video to see how easy it is to spot and fix such issues (you can find the sample solution at the end of the post).

After all, this is why we want to use such tools – they work for us. It is much easier to fix performance and memory issues in the implementation/construction phase rather than in the test/validation phase of the project iteration.

Being proactive and using profiling tools during the whole ALM will help you to build better products and have happy customers.

WindowsFormsApplication1.zip