Software Performance Engineering

Every time we build software we have functional requirements. Maybe these functional requirements are not well defined but they are good enough to start prototyping and you refine them as you go. After all, the functional requirements describe what you have to build. Sometimes, you have additional requirements that describe how your software system should operate, rather than how it should behave. We call them non-functional requirements. For example if you build a web site, a non-functional requirement can define the maximum page response time. In this post I am not going to write about the SPE approach, but rather about the more general term of software performance engineering.


Anytime a new technology emerges people want to know how fast it is. We seem obsessed with this question. Once we understand the basic scenarios where a new tech is applicable we start asking questions about its performance. Unfortunately, talking about performance is not easy because of all misunderstanding around it.

Not quick. Efficient.

Let’s set up some context. With any technology we are trying to solve some problem, not everything. In some sense, any technology is bound to the time. In general, nowadays we solve different problems than 10 years ago. This is so obvious in the IT industry as it changes so fast. Thus, when we talk about given technology it is important to understand where it comes from and what are the problems it tries to solve. Obviously, a technology cannot solve all current problems and there is no best technology for everything and everybody.

We should understand another important aspect as well. Usually when a new technology emerges it has some compatibility with an older one. We don’t want people to learn again how to do trivial things. Often this can have big impact on performance.

Having set up the context, it should be clear that the interpreting performance results is also time bound. What we have considered fast 5 years ago, may not be fast any longer. Now, let’s try to describe informally what performance is and how we try to build performant software. Performance is a general term describing various system aspects such as operation completion time, resource utilization, throughput, etc. While it is a quantitative discipline it does not define itself any criterion what is good or bad performance. For the sake of this post, this definition is good enough to understand why performance is important in say, algorithmic trading. In general, there is a clear connection between performance and cost.

IT Industry-Education Gap

Performance issues are rarely measured in a single value (e.g. CPU utilization) and thus they are difficult to understand. These problems become even more difficult in distributed and parallel systems. Despite the fact that performance is important and difficult problem, most universities and other educational facilities fail to prepare their students so they can avoid and efficiently solve performance issues. The IT industry has recognized this fact and there are companies like Pluralsight, Udacity and Coursera that offer additional training on this topic.

In the rare cases where students are taught on localizing and solving performance issues, they use outdated textbooks from ’80s. Current education cannot produce well-trained candidates mostly because the teachers have outdated knowledge. On the other hand, many (online) education companies offer highly specialized performance courses in say, web development, C++, Java or .NET, which cannot help students to understand the performance issues in depth.

Sometimes the academia tries to help the IT industry providing facilities like cache-oblivious algorithms or QN models but abstracting real hardware often produces suboptimal solutions.

Engaging students in real-life projects can prepare them much better. It doesn’t matter whether it is an open-source project or a collaboration with the industry. At present, students just don’t have the chance to work on a big project and thus miss the opportunity to learn. Not surprisingly the best resources on solving performance issues are various blogs and case studies from big companies like Google, Microsoft, Intel and Twitter.

Performance Engineering

Often software engineers have to rewrite code or change system architecture because of performance problems. To mitigate such expensive changes, many software engineers try to employ various tools and practices. Usually, these practices can be formalized in the form of an iterative process which is part from the development process itself. A common simplified overview of such iterative process might be as follows:

  • identify critical use cases
  • select a use case (by priority)
  • set performance goals
  • build/adjust performance model
  • implement the model
  • gather performance data (exercise the system)
  • report results

Different performance frameworks/approaches emphasize on different stages and/or techniques for modelling. However, what they all have in common is that it is an iterative process. We set up performance goals and perform quantitative measurements. We repeat the process until we meet the performance goals.

Most performance engineering practices heavily rely on tools and automation. Usually, they are part from various CI and testbed builds. This definitely streamlines the process and helps the software engineers. Still, there is a big caveat. Building a good performance model is not an easy task. You must have a very good understanding of the hardware and the whole system setup. The usual way to overcome this problem is to provide small, composable model templates for common tasks so the developer can build larger and complex models.

In closing I would say that there isn’t a silver bullet when it comes to solving performance issues. The main reason to make solving performance issues difficult is that it requires a lot of knowledge and expertise from both software and hardware areas. There are a lot of places for improvement both in the education and the industry.

Efficient IO in Android

What could be simpler than a file copy? Well, it turned out that I underestimated such an easy task.

Here is the scenario. During the very first NativeScript for Android application startup the runtime extracts all JavaScript asset files to the internal device storage. The source code is quite simple and it was based on this example.

static final int BUFSIZE = 100000;

private static void copyStreams(InputStream is, FileOutputStream fos) {
    BufferedOutputStream os = null;
    try {
        byte data[] = new byte[BUFSIZE];
        int count;
        os = new BufferedOutputStream(fos, BUFSIZE);
        while ((count =, 0, BUFSIZE)) != -1) {
            os.write(data, 0, count);
    } catch (IOException e) {
        Log.e(LOGTAG, "Exception while copying: " + e);
    } finally {
        try {
            if (os != null) {
        } catch (IOException e2) {
            Log.e(LOGTAG, "Exception while closing the stream: " + e2);

It is important to note the in our code BUFSIZE constant has value 100000 while in the original example the value is 5192. While this code works as expected it turns out it is quite slow.

In our scenario we extract around 200 files and on LG Nexus 5 device it takes around 5.75 seconds. This is a lot of time. It turned out that most of this time is spent inside the garbage collector.

D/dalvikvm(8611): GC_FOR_ALLOC freed 265K, 2% free 17131K/17436K, paused 8ms, total 8ms
D/dalvikvm(8611): GC_FOR_ALLOC freed 398K, 4% free 16930K/17636K, paused 11ms, total 11ms
D/dalvikvm(8611): GC_FOR_ALLOC freed 197K, 4% free 16930K/17636K, paused 7ms, total 7ms
... around 650 more lines

The first thing I optimized was to make data variable a class member.

static final int BUFSIZE = 100000;

static final byte data[] = new byte[BUFSIZE];

private static void copyStreams(InputStream is, FileOutputStream fos) {
   // remove 'data' local variable

I thought this will solve the GC problem but when I ran the application I was greeted with the following familiar log messages.

D/dalvikvm(8408): GC_FOR_ALLOC freed 248K, 2% free 17212K/17496K, paused 7ms, total 8ms
D/dalvikvm(8408): GC_FOR_ALLOC freed 417K, 4% free 17029K/17696K, paused 8ms, total 8ms
D/dalvikvm(8408): GC_FOR_ALLOC freed 199K, 4% free 17029K/17696K, paused 7ms, total 7ms
... around 330 more lines

This time it took around 2.25 seconds to extract the files. And the GC kicked 330 times instead of 660 times. Well, it was better but it wasn’t what I wanted. The GC kicked twice less than the previous example but still it was too much.

The next thing I tried is to set BUFSIZE to 4096 instead of 100000.

static final int BUFSIZE = 4096;

This time it took around 0.85 seconds to extract the assets and the GC kicked 8 times.

D/dalvikvm(8218): GC_FOR_ALLOC freed 323K, 3% free 17137K/17496K, paused 8ms, total 8ms
D/dalvikvm(8218): GC_FOR_ALLOC freed 673K, 5% free 16947K/17684K, paused 8ms, total 9ms
D/dalvikvm(8218): GC_FOR_ALLOC freed 512K, 5% free 16947K/17684K, paused 8ms, total 9ms
... just 5 more lines

It was a nice improvement but I thought it should be faster than this. I was still puzzled with this relatively high level of GC activity so I decided to read the online documentation.

A specialized OutputStream for class for writing content to an (internal) byte array. As bytes are written to this stream, the byte array may be expanded to hold more bytes.

I’ve should read this before I start. It was a good lesson to me.

Once I knew what happens inside BufferedOutputStream internals I decided just not to use it. I call write method of FileOutputStream and voilà. The time to extract the assets is around 0.65 seconds and the GC kicks 4 times at most.

Out of curiosity I decided to try to bypass the GC using libzip C library. It took less than 0.2 seconds to extract the assets. Another option is to use AAssetManager class from NDK but I haven’t tried it yet. Anyway, it seems that IO processing is one of those areas where unmanaged code outperforms Java.

Native code profiling with JustTrace

The latest JustTrace version (Q1 2014) has some neat features. It is now possible to profile unmanaged applications with JustTrace. In this post I am going to show you how easy it is to profile native applications with JustTrace.

For the sake of simplicity I am going to profile notepad.exe editor as it is available on every Windows machine. First, we need to setup the symbol path folder so that JustTrace can decode correctly the native call stacks. This folder is the place where all required *.pdb files should be.


In most scenarios, we want to profile the code we wrote from within Visual Studio. If your build generates *.pdb files then it is not required to setup the symbols folder. However, in order to analyze the call stacks collected from notepad.exe we must download the debug symbols from Microsoft Symbol Server. The easiest way to obtain the debug symbol files is to use symchk.exe which comes with Microsoft Debugging Tools for Windows. Here is how we can download notepad.pdb file.

symchk.exe c:\Windows\System32\notepad.exe /s SRV*c:\symbols*

[Note that in order to decode full call stacks you may need to download *.pdb files for other dynamic libraries such as user32.dll and kernelbase.dll for example. With symchk.exe you can download debug symbol files for more than one module at once. For more details you can check Using SymChk page.]

Now we are ready to profile notepad.exe editor. Navigate to New Profiling Session->Native Executable menu, enter the path to notepad.exe and click Run button. Once notepad.exe is started, open some large file and use the timeline UI control to select the time interval of interest.


In closing, I would say that JustTrace has become a versatile profiling tool which is not constrained to the .NET world anymore. There are plenty of unmanaged applications written in C or C++ and JustTrace can help to improve their performance. You should give it a try.

Notes on Asynchronous I/O in .NET

Yesterday I worked on a pet project and I needed to read some large files in an asynchronous manner. The last time I had to solve similar problem was in the times of .NET v2.0 so I was familiar with FileStream constructors that have bool isAsync parameter and BeginRead/EndRead methods. This time, however, I decided to use the newer Task based API.

After some time working I noticed that there was a lot of repetition and my code was quite verbose. I googled for an asynchronous I/O library and I picked some popular one. Indeed the library hid the unwanted verbosity and the code became nice and tidy. After I finished the feature I was working on, I decided to run some performance tests. Oops, the performance was not good. It seemed like the bottleneck was in the file I/O. I started JustDecompile and quickly found out that the library was using FileStream.ReadAsync method. So far, so good.

Without much thinking I ran my app under WinDbg and set breakpoint at kernel32!ReadFile function. Once the breakpoint was hit I examined the stack:

0:007> ddp esp
0577f074  720fcf8b c6d04d8b
0577f078  000001fc
0577f07c  03e85328 05040302
0577f080  00100000
0577f084  0577f0f8 00000000
0577f088  00000000

Hmm, a few wrong things here. The breakpoint is hit on thread #7 and the OVERLAPPED argument is NULL. It seems like ReadAsync is executed in a new thread and the read operation is synchronous. After some poking with JustDecompile I found the reason. The FileStream object was created via FileStream(string path, FileMode mode) constructor which sets useAsync to false.

I created a small isolated project to test further ReadAsync behavior. I used a constructor that explicitly sets useAsync to true. I set the breakpoint and examined the stack:

0:000> ddp esp
00ffed54  726c0e24 c6d44d8b
00ffed58  000001f4
00ffed5c  03da5328 84838281
00ffed60  00100000
00ffed64  00000000
00ffed68  02e01e34 00000000
00ffed6c  e1648b9e

This time the read operation is started on the main thread and an OVERLAPPED argument is passed to the ReadFile function.

0:000> dd 02e01e34 
02e01e34  00000000 00000000 04c912f4 00000000
02e01e44  00000000 00000000 72158e40 02da30fc
02e01e54  02da318c 00000000 00000000 00000000
0:000> ? 04c912f4 
Evaluate expression: 80286452 = 04c912f4

A double check with SysInternals’ Process Monitor confirms it.


I emailed the author of the library and he was kind enough to response immediately. At first, he pointed me to the following MSDN page that demonstrates “correct” FileStream usage but after a short discussion he realized the unexpected behavior.


I don’t think this is a correct pattern and I quickly found at least two other MSDN resources that use explicit useAsync argument for the FileStream constructor:

In closing, I would say that simply using ReadAsync API doesn’t guarantee that the actual read operation would be executed in an asynchronous manner. You should be careful which FileStream constructor you use. Otherwise you could end up with a new thread that executes the I/O operation synchronously.

Why do we need profiling tools?

Every project is defined by its requirements. The requirements can be functional and non-functional. Most of the time the developers are focused on functional requirements only. This is how it is now and probably it won’t change much in the near future. We may say that the developers are obsessed by the functional requirements. In the matter of fact, a few decades earlier the software engineers thought that the future IDE will look something like this:

This is quite different from nowadays Visual Studio or Eclipse. The reason is not that it is technically impossible. On contrary, it is technically possible. This is one of the reasons for the great enthusiasm of the software engineers back then. The reason this didn’t happen is simple. People are not that good at making specifications. Today no one expects to build large software by a single, huge, monolithic specification. Instead we practice iterative processes, each time implementing a small part of the specification. The specifications evolve during the project and so on.

While we still struggle with the tools for functional requirements verification we made a big progress with the tools for non-functional requirements verification. Usually we call such tools profilers. Today we have a lot of useful profiling tools. These tools analyze the software for performance and memory issues. Using a profiler as a part from the development process can be a big cost-saver. Profilers liberate software engineers from the boring task to watch for performance issues all the time. Instead the developers can stay focused on the implementation and verification of the functional requirements.

Take for example the following code:

FileStream fs = …
using (var reader = new StreamReader(fs, ...))
    while (reader.BaseStream.Position < reader.BaseStream.Length)
       string line = reader.ReadLine();

       // process the line

This is a straightforward and simple piece of code. It has an explicit intention – process a text file reading one line at a time. The performance issue here is that the Length property calls the native function GetFileSize(…) and this is an expensive operation. So, if you are going to read a file with 1,000,000 lines then GetFileSize(…) will be called 1,000,000 times.

Let’s have a look at another piece of code. This time the following code has quite different runtime behavior.

string[] lines = …
int i = 0;
while (i < lines.Length)

In both examples the pattern is the same. And this is exactly what we want. We want to use a well-known and predictive patterns.

Take a look at the following 2 minutes video to see how easy it is to spot and fix such issues (you can find the sample solution at the end of the post).

After all, this is why we want to use such tools – they work for us. It is much easier to fix performance and memory issues in the implementation/construction phase rather than in the test/validation phase of the project iteration.

Being proactive and using profiling tools during the whole ALM will help you to build better products and have happy customers.