Writing Efficient Software

With the advent of large physical memories and large disks, the trend continues to be to care little about the memory/space requirements of programs and their data, and to "just code". While this approach is easily seen in commercial software products, the problems of program bloat and inefficiencies also crop up in many smaller projects. Simulators built today to handle 1 million requests per day without difficulty suddenly break down when 10 million requests per day becomes the norm. And while allocating more disk or RAM to the program may give it a new life for a short time, these types of solutions can only go so far.

In this document we look at how and why programmers should be concerned with making good use of both in-core and on-disk memory, with a small emphasis on web-related simulators. While the programming techniques are expressed in C and C++, the basic principles should also apply to other languages like Perl.

Background

It is easy to get caught up in the "heat of the moment" when designing a new piece of code. With a reasonably sized set of input data, and shiny new hardware, it's easy to say things like:

We'll use doubly linked lists, and have a
dynamic binary tree, and the data will be read in from a flat ASCII
file.  We've only got 3 weeks of data to handle, and that is only 10
million entries.  The machine has 128MB of RAM, and we have lots of
disk.

Unfortunately, this sort of thinking leads directly into problems. Is the doubly linked list necessary? Will the dynamic binary tree remain close to balanced? Will it make good use of the space? Will the traces ever get heavier, to the point that the simulator will no longer have enough RAM to run efficiently? Is an ASCII format the best format for this trace? When the simulator needs to handle 10 weeks of data, and 100 million entries, will it still cope? These are the sorts of questions which should be asked when designing programs, and these are the issues that are discussed in this document. The prompting to write this document came from realizing that many of the tricks, tips, and techniques used in our web server and web proxy simulators to improve their performance were not documented anywhere. These simulators are trace-driven, and the size of the trace files (and the functionality of the simulator) only seems to grow over time. Thus, much of the information here is based on the things we had to do to make our simulators more efficient, and capable of handling large amounts of trace data (e.g. 1GB compressed) as quickly as possible. Since the raw data files (web server or web proxy traces) were typically too large (multiple GB), we also needed to find efficient ways of converting that data down into something that was more managable in size. So while at one level we analysed web server and web proxy performance as part of our research, at another level we were also analysing the performance of our simulator and associated tools (albeit less rigorously), in an attempt to make everything faster and better.

Choice of languages

Most of the tips and tricks presented here are applicable to C and C++, but when writing simulators it may be useful to consider Perl. Perl, which is typically interpreted, is a very powerful scripting language. What takes only a few lines in perl can require pages of C or C++ code - in particular, text manipulation is very easy to handle in Perl. Thus, Perl makes a very good pre-processor for trace files, and can be used, for example, to strip out unwanted or un-needed information from ASCII traces. A huge advantage of Perl is the fact that it is interpreted - this allows easy modification and execution of the scripts without tedious re-compiles.

Object-oriented languages, if used properly, can have large impacts on the scalability and re-usability of simulator code. Cache objects, trace objects, and even simulation objects make the job of handling multiple types of caches and multiple types of trace input much easier. With a little bit of inheritance, the simulator only needs to be concerned about querying a cache as to it's contents, rather than caring about whether the cache is LRU, LFU, or whatever. From the input file perspective, each input format is hidden inside of an object, again such that the simulator only needs to ask for the 'next trace record', which can be done in a trace-independent fashion.

Out of Memory!

Nearly all modern operating systems have a concept of virtual memory, allowing programs to use more RAM (at least virtually) than is available in the system. In many cases, however, per-user resource limits may be well under the maximum limits supported by the operating system. So while one should always advocate making programs run efficiently, it may still be necessary to look at what your 'limits' are. On many systems, the command

limit

will display something that looks like:

cputime         unlimited
filesize        unlimited
datasize        131072 kbytes
stacksize       2048 kbytes
coredumpsize    unlimited
memoryuse       58664 kbytes
memorylocked    19556 kbytes
maxproc         1044 
openfiles       64

Now you might think this should be enough - after all, it says that programs and data can occupy 128MB and that the program stack can be 2MB. However for large simulations, and on large-memory machines, such a restriction may be unwarranted. This 'problem' can be solved with:

limit datasize unlimited
limit stacksize unlimited

which now gives:

cputime         unlimited
filesize        unlimited
datasize        1048576 kbytes
stacksize       32768 kbytes
coredumpsize    unlimited
memoryuse       58668 kbytes
memorylocked    19556 kbytes
maxproc         1044 
openfiles       64

As is seen here, the program now has 1GB of virtual memory to work in, and a stack size of 32MB. On a machine with 256MB of real RAM, this means the simulator can now use the other 128MB that would have been otherwise unavailable.

Note that increasing the stack should not really be necessary unless your program has a lot of recursion going on. Additional discussion about function calls, and local variables, both of which affect the stack memory usage, is deferred until later.

Note as well that some administrators will set the cputime value above to be some fixed value - e.g. 2 hours. What this does is restrict the amount of CPU time that any given program gets to that fixed value. Once that time has elapsed, the system terminates the program. While this can be very useful for taking care of the problem of runaways (programs that eat up CPU and memory resources, and do not do anything useful, but who's owners may not know the programs are running), it can be problematic for a simulation that may need overnight to run. So a:

limit cputime unlimited

may be in order before running those time-consuming simulations.

Breaking the paging barrier

While virtual memory allows us to write programs which execute with memory sizes much larger than physical RAM, there is a speed price which is paid every time part of the program or data set is "paged" in or out. Any good book on Operating Systems will discuss paging in more detail, but here we refer to paging as simply moving a chunk of memory to/from some sort of "secondary storage" (be it a local swap disk, a network-mounted swap device, etc.) For the sake of this discussion, the less paging that happens with your program, the better. Programs which do not page will typically run noticeably faster than those which do. If your program is paging when it runs, then there may be a number of solutions:

Buy more memory
Kick other programs/people off of the machine
Be more efficient

In many cases, the latter is the only real option, and the former two "options" only serve to delay the problem (and/or irritate colleagues!).

Data allocation: Static vs. Dynamic

There are some interesting trade-offs between allocating all the required data structures at once at the beginning, vs. allocating each one on-the-fly. Allocating all at the beginning can be useful if you known in advance how many you will need, and if the allocation does not start your program paging from the very start! If the allocation at the start does cause your program to page, and/or a large number of the structures will not be used for most of the duration of the run, then perhaps a different allocation scheme is in order.

Dynamic allocation done 'one-at-a-time' will help ensure that your program never allocates more memory than it intends on using, but at the expense of continually calling malloc/new. It is important to note here that calling malloc/new frequently is a very expensive operation. Although the efficiency of this operation has improved in recent years, it is still not nearly as efficient as pre-allocation of a large number of structures. In a slightly different vein, if an algorithm looks like:

while(!done) {
   if ( (space = malloc( sizeof(struct foo) )) ==0) { ... }
   initialize(space);
   ...
   free(space);
}

you will likely experience a *huge* speed increase by changing that to:

if ( (space = malloc( sizeof(struct foo) )) ==0) { ... }
while(!done) {
   initialize(space);
   ...
}
free(space);

A nice tradeoff to the "do everything at the beginning" vs. "one at a time" is to allocate space in "banks". Each "bank" is an array of some number of structures. Each time a new structure is needed, a pointer is returned to the "next" element in the array. Once the array is used up, a new array is allocated, and the process continues. If memory is constantly being allocated and freed, the bank algorithm can be used even more effectively. When a structure is freed, a pointer to that structure can simply be put at the head of a linked-list (called the "free list"). When a new structure is requested, the allocation routine first looks to see if there are any freed structures already on the free list. If there are, one is returned from the free list. If not, a new one is returned from the array.

By tuning the size of the banks (i.e. the number of structures allocated to each array), it is possible to require only 3 or 4 bank creations during the execution of a program, while at the same time having very fast allocation/freeing of the structures (O(1) in both cases, ignoring the overhead for allocating a new bank). By allocating banks at a time, efficiencies are achieved by only calling malloc/new as necessary, space is easily (and quickly) allocated/freed, and yet the space requirements are only slightly larger than would be required at any given point (i.e. it allocates at most a bank worth of space more than would be required at that point by an on-the-fly allocation scheme).

If the overhead of maintaining the "bank" structure seems too great, it may be useful to simply maintain the free list, building it from previously malloced structures that have been freed. This has the benefit of not pre-allocating nearly as many structures, yet allowing fast access to space for which a malloc has already been performed.

Also note that repeated mallocing and freeing can sometimes cause the memory to become fragmented. While the memory allocaters may try to prevent this, programs which use malloc and free as infrequently as possible will typically experience the least amount of memory fragmentation. So in instances where you have allocated 2MB for a data structure, and you know it'll be needed in a little while, it may be useful to not free that space, and to then simply re-use it the next time the function is called.

Where is your data stored: Memory Heap vs. Stack

Although the need for and use of global and local variables are well understood, where space for a variable is created may be less well understood. Global variables have space allocated for them on the general memory heap. Local variables, that is, those variables declared within a function, have their space allocated on the run-time stack. This means, for example, that if you declare a local variable that is an array of 2000 structures, then that entire array will be stored on the stack. Now while a single use like this might be feasible, recursively calling this function is going to require using a large amount of stack space. Since the stack size is typically small (e.g. 2MB), it is easy to run out of stack with a heavily recursive program. A better approach may be to malloc/new the space upon the invoking of each function. Malloc/new'ed space is allocated on the heap. Note that local variables within 'main()' are subject to these same restrictions. Thus if you find your program crashing, and it has a lot of recursion, check your stack limit, and check the amount of data you are storing in local variables.

Parameter Passing

A common problem related to the use of the run-time stack is the way that data is passed from one function to another. For example, it is *much* less expensive (both in time and space) to pass pointers to large data structures, rather than passing the entire data structure by value. Passing the structures by value requires a lot more stack space, which, as already mentioned, is typically in short supply. If the structures being passed down are small (e.g. only a single integer or a couple of integers), then passing down by value probably makes sense. But anything larger than that should be typically passed down via a pointer (or other reference).

Memory Leaks

If your program is using large amounts of memory, you may want to verify that you do not have a memory leak. A simple way to do this (in C) is to simply record (via a couple of global variables) the number of mallocs performed, and the size of data allocated, and to compare that at the end of the run with the number of free's performed, and the amount of data freed. If the instrumentation is done correctly, and you can account for all of the memory used, then chances are good that you do not have any memory problems. If, however, you seem to have a much higher number of mallocs than you have free's, it may be worth looking into why that is the case.

Trees: Size vs. Balance

The average binary tree is a well known and well studied data structure. Known for its O(log n) insertion and retrieval times, the tree can be quite adept at handling data. There are a number of issues to keep in mind, however, when designing how the tree will be constructed.

Dynamic vs. Static Trees

If the size of the tree is known in advance, then it may be useful to pre-allocate the entire tree in a linear array, and do away with pointers entirely! There are a number of large gains here:

A savings of two pointers per node
A guarantee of O(log n) time for insertions, deletions, and lookups
A guarantee of balance
Straightforward algorithms

If the size of the tree is not known in advance, it still may be possible to use an array. Once the array fills up, a new (larger) array is created (could be 2x the former size, or some fixed size larger). The old array contents are then copied into the new array, and the old array is removed, and the program then uses the new array. (Note that if you have pointers to any of the elements in the old array, this method requires that those pointers be updated appropriately!!! This may not be a trivial undertaking!)

Otherwise a fully dynamic tree may be required. The dynamic tree typically has the advantage of only being as large as is necessary to hold all of the current data, but has large disadvantages if the tree is unbalanced (not to mention the difficult balancing algorithms that may need to be used if balance becomes a factor).

Singly Linked Lists

While traversing a singly linked list can take O(n) time to find a particular item, singly linked lists can be a very efficient way of dealing with queues or stack structures. If you can always delete from the front, and can know that the insertion will be always at the front or at the back of the list, then both insertions and deletions can be performed in O(1) time! Insertions at the tail may require keeping a pointer to the tail (so you do not have to traverse through the list to find the end!), but that is a small price to pay for O(1) inserts.

Do not be tempted to make a singly linked list doubly linked just for the sake of having a doubly linked list. If you are forced to traverse the list from beginning to end to search for an item to remove, and think that having reverse pointers will make removing that item easier, you may be adding more overhead than you really need. If you are having to traverse the list searching anyways, then a "chasing pointer" (i.e. pointer to the previous item) may be just as effective.

Doubly Linked Lists

Doubly linked lists can be very efficient if there is a need to traverse both directions in a list. The most useful case is when you are handed a pointer to the item, and need to find the items on either side (e.g. for a deletion). There are cases, however, where a doubly linked list has been used where the "reverse pointer" was never used! There is an ideal place to save a pointer worth of data in each structure on the list, and make the list singly linked.

Heaps

Never underestimate the power of the heap for holding data that must be accessed in a sorted fashion. Heaps are especially useful for event-based simulations, where an event heap can be used to store the set of pending events. If done as an array, insertions are guaranteed to be O(log n), queries for the next event are O(1), and deletions are O(log n).

Hashing

It's well known that a hash-based search can often do much better than a linear or binary search, but it's important to stress this fact again. If hashing results in a O(n + e) rather than O(n log n) algorithm, that improvement by a factor of log n can really add up if n is 10000000 elements (factor of 23, assuming log_2 n.) and 'e' is a small constant.

Endian Issues

The two most common endian formats are Big Endian and Little Endian. The only comment made here is to ensure that where possible, the traces you use should be of the same endianness as the CPU architecture so that your program does not have to do on-the-fly endian conversions as each trace record is read.

To point, or not to point, that is the question

Pointers are used to refer to some other data object or collection of data objects. There are times, however, when as natural as it might seem to use a pointer, the use of a pointer is actually prohibitively expensive. Consider an array of n pointers, and a set of n data elements, each pointer pointing to a single element. If the size of the pointer is 4 bytes (8 bytes on a 64-bit machine) and the size of the data element is 100 bytes, then the use of the pointer is probably warranted. If, however, the data element is only 4 bytes, or even 8 bytes, then perhaps the data structure would be better done as a single array of the data elements. The problem here is that for every 4 bytes of data, there are 4 bytes of pointer overhead - that's a 100% overhead in the amount of space required to hold the data structure, just due to the pointers alone. The big question here is of what benefit are those pointers (yes, there may be instances where this is what you need), and why should you continue to use them? If the data structure pointed to is 8 bytes, then there is a 50% overhead for pointers. Bad, but not the end of the world if it makes the program simpler in some other respect. Anything larger than this likely means that pointers are fine - but not always! If you have a 20 byte structure, and you know there are only going to be at most 2000 of them needed in the program, allocate all of them in a single array! There is no need to put pointers in the array, and then dereference things all of the time. Rather than having 2000 4-byte pointers in an array, and 2000 20-byte data structures in the general storage space, just allocate the 2000 20-byte data structures in the array and be done with it. In this case, there is a 17% saving in space on this data structure, just by eliminating the pointers. (Think of how much real memory that is when the number of elements is 10 million, rather than 2000. Hint: It's close to 40MB of space, which if that's enough to mean that your program is no longer paging, will be a huge speed increase as well).

If pointers are necessary, it may be possible to still save a little space. For example, rather than having a structure that looks like:

struct foo { 
   int key;
   int value;
   struct foo *next;
};

if may be possible to use something like:

struct foo {
   int key1;
   int value1;
   int key2;
   int value2; 
   int key3;
   int value3;
   itn key4;
   int value4;
   struct foo *next;
};

In this case, the cost may be the need to look at all the 'keys', but the savings is in not having to maintain 3 additional pointers. Again, this is a time-space tradeoff, and whether or not to use this technique will depend on the application.

Grouping data

In looking at the following two structures, it would appear that both contain the same data, and take up the same space:

 struct foo1 {
  char a;
  int  b;
  char c; 
  int  d; 
  char e;
  char f;
 };

and

 struct foo2 {
  int b;
  int d;
  char a;
  char c;
  char e; 
  char f;
 };

Due to the way that data gets aligned by the compiler, however, the above two structures are not the same. On a 32-bit int system, foo1 will be 20 bytes, while foo2 will be a mere 12 bytes. In general, you want to keep the variables arranged such that for a given set of 32 bits (or 64 bits), that as many variables as possible are packed into that space. In foo1, 'a' requires a full 32 bits, becase 32 bits is the natural word size of the processor, and both 'a' and 'b' cannot be put into the same 32-bit space ('b' is already 32-bits!). In foo2, however, 'a', 'c', 'e', and 'f' can all be packed into the same 32-bit location since a) they are all next to each other, and b) each is only 8-bits (the compiler knows how to stuff multiple smaller structures into the 32-bit space). Similar behaviour can be obtained from other data structures which are smaller than the natural word size of the processor.

What's in the data

As hinted to, it is important to study the data structures in use (especially those that get used a lot!) to make sure that they are efficient. For example: why use a 32-bit value when a 8-bit or 16-bit structure would suffice? If the data range is from 0 to 255, and there are 10 million entries, using a 32-bit value to represent the data results in 30MB of wasted space! Examine true/false flags in the structures. If you have:

char flag1;
char flag2;
char flag3;
char flag4;

perhaps you can get away with:

char flags; /* bit 0 is flag1, bit 1 is flag2, etc. */

Note that while this is still wasting 4 bits of the char, it is better than wasting 28 of 32 bits! Again, with large data sets, a simple change like the above can save 30MB on 10 million entries.

File formats and File IO

A simple ASCII flat file is a wonderful format - it is typically relatively straight forward, well understood, and easy to read with a variety of standard tools. 'grep', 'awk', 'perl', and other utilities can be used to pick out appropriate bits of data. Unfortunately, the simplicity of the format has it's drawbacks:

The data file is typically larger than a binary format would be - e.g. up to 4 ascii values may be represented as a single byte in a binary trace.
Larger trace files typically mean more time required to read in the trace - especially with functions like fscanf().

By converting an ASCII trace into a binary trace, a smaller trace file can be achieved, and that smaller file can be read into a simulator much more quickly. The performance improvement that we experienced when going from using fscanf() to fread(), and from an ASCII trace file to a binary trace file was nearly an order of magnitude. We went from 3 hours down to 18 minutes to read the trace - at which point computation, not IO, became the bottleneck.

It should be noted that many architectures may require padding data structures to 32-bit boundaries (i.e. rounded to the nearest 32-bit quantity). Thus it may be useful to take:

struct trace_item {
   int field1;
   int field2; 
   char field3; 
   char field4;
}

and add padding (in this case, two bytes):

struct trace_item {
   int field1;
   int field2; 
   char field3; 
   char field4;
   char pad1;
   char pad2;
}

This will ensure that the structure is nicely aligned, and will avoid problems when writing to and reading from a trace file.

Input sources

Where possible, read data from the local disk, rather than from a networked disk. For modern systems, data can be accessed much more quickly from the local disk, and others sharing your network segment will not get upset. Most modern machines with 10Mbps Ethernet can easily saturate such a network with simple NFS transfers. This is not a good way to gain friends.

Compressed Input

If disk bandwidth seems to be a bottleneck, your input data may benefit from being compressed. Compression of data on-disk has two benefits:

Smaller disk usage.
Less data to read from disk, which means the total data can be read in faster if IO is the bottleneck

If IO really is the bottleneck, then you will be reading less data from the disk, and the unused CPU cycles can be used to uncompress the data on-the-fly before giving your program the input it expects.

Reading data from a compressed file is relatively easy. After determining that the file is in fact compressed (via filename extension, or a flag, or whatever), your program can do a simple 'popen()' instead of a 'fopen()':

  if (compressed) {
    sprintf(pipe,"gzip -dc %s",filename);
    if ((infile=popen(pipe,"r"))==NULL) {
      cerr << "Unable to open pipe for input.\n";
      exit(1);
    }
    
  } else {
    if ((infile=fopen(filename,"r"))==NULL) {
      cerr << "Unable to open " << filename << " for input.\n";
      exit(1);
    }
  }

After this code is run, the file handle 'infile' will look and behave like a normal sequential file - it can be read as normal by fread(), and your program will not need to even know that the file is compressed! For a 1GB uncompressed trace that compresses to 256MB, the time savings to just read the trace from disk can be phenomenal!

Program Status Outputs

If your program prints out volumes of information regarding it's current status, you may be running much slower than you need to be. To test this, run your program with its output redirected to nowhere:

./myprogram >& /dev/null

and time how long it takes. Compare that time with how long it usually takes. The difference is the time required to do all of that IO! By removing the extra printf's or cout's, the running time of a program can be dramatically improved, especially if the amount of data being output is quite large. Some techniques for avoiding printing volumes of data include:

Print out status indicators less frequently. If you print out an "I'm here" message every 1000 trace records, perhaps only print it out every 10000 trace records
Save the output until the end. Hold off on printing stuff out until the program completes. Depending on how much data you really need to get out of the program, and/or the type of data, this approach may not be feasible. But if you are not going to use that running average, why bother printing it?
If you really must collect a lot of data, consider writing it to a (compressed) file, instead of just to the standard output.
If the standard output is the place you want the data to go, then consider simply re-directing the output to a file. Chances are that the file IO will be much faster than your terminal IO.

Natural data structures

While this document has talked about using the most efficient size for a data structure (e.g. use only a single character if your data values are in the range 0-255), there are cases where the most time efficient data structure may not be the most space efficient structure. Many CPU's have a "natural word size" that is their most efficient data size. To use any other data size requires either special handling of the data, or may require other special operations in order for the data to be used properly. Consider, for example, the partial code fragment:

signed char x;
for(x=-40;x<100;x++)

On a CPU with a natural word size of 32-bits, it may actually be more expensive to use a 8-bit char instead of 32-bit int. There are a number of reasons for that:

When transferring x to/from memory, does the CPU have to do conversions from one size to another. E.g. are there special bus alignment requirements that require shifting these 8-bits in some non-natural way.
How does the CPU deal with the sign bit of the value -40? The most significant bit is usually the sign bit, but on a 8-bit char value, how much extra work will the CPU have to do to make sure that it can detect that the char is negative or positive?

Simple counters, therefor, should likely be done in the natural word size of the CPU. If the variable is put into a CPU register there is no saving in storage, and may be an execution penalty.

Incremental Methods and Partial Computations

The basic premise of incremental algorithms is to take a data structure in a known state, apply some small change to it, and have the data structure in another well known state, without having to re-do all the computations that were necessary to get it into the old state. For example, to maintain a running average of values being read in, rather than maintaining each value in an array, and summing (and counting) the array every 10 seconds, simply maintain a current sum and a current count that is updated each time a new element is added to the array. While this sounds obvious and trivial, the same principles can be applied to things like computing standard deviations, or keeping track of averages of items that are being added and removed. The trick with incremental algorithms is to realize that the differences between the two different "states" of the data structures are very small, and that previous computations do not always need to be re-done.

The basic idea with partial computations is to do as little work as you have to at the current stage - especially if the work that you do now will simply be re-done, overwritten, or ignored at a later stage. In our web-server simulator, each request being served is given its portion of the bandwidth for the given time that has elapsed. The requests are sorted by size, in ascending order. If the per-request bandwidth is larger than a requests remaining bytes to be served, then that request finishes in the given time interval. What we also know, however, is that if the per-request bandwidth is smaller than the bytes remaining to be served for the current request, that none of the requests in the remainder of the list will finish this time either! So we use a scheme where we could stop handing out bandwidth to any request that was not going to complete in this time interval. For a request of 2048 bytes, there was no point in doing 8 subtractions of 256 bytes when a single subtraction of 2048 bytes would accomplish the same result.

Scalability

Many programs get written on an ad hoc basis, where the data set is limited, and the amount of functionality required is small. Some of these programs, however, get forced to deal with data sets that are much larger than the original sets, and this is where the issues of scalability surface. When designing a simulator or other program, it may be necessary to think about how the input data set may change. What happens when the trace goes from 10 million requests in 3 weeks to 20 million requests in one week - can the simulator still handle the 60 million requests for the 3-week period? Is the simple sorting algorithm being used going to be able to handle 100,000,000 items? Will that linked list still be efficient if it becomes larger than 20 items? If a limit on the size of the input set can be known and accounted for, then some of these questions become easy to handle. In many other cases, however, the size of the input data set will not be known in advance, and it is up to the program designer to make sure that the program will scale as necessary.

Multiple simulations from a single trace

Depending on the amount of RAM required to run a single simulation, it may be beneficial to perform more than one simulation in a single run. For example, if 30 minutes are required to read in a trace, and it only takes 10 minutes to do the simulation computation, then 75% of the processing time is just IO overhead. If you can run 3 simulations simultaneously, using the same trace as the single input source, the entire run may take 60 minutes total (30 + 3 * 10), but now only 50% of the running time is IO overhead. By adding additional simulation runs, the cost of reading the trace can be amortized over all of the simulation runs. Thus in this example, only 10 minutes of reading the trace would be charged to each instance of the program. If 10 simulations were run simultaneously, then the run would only take 90 minutes, and the cost per simulation of reading the trace would be a mere 3 minutes. The per-simulation run-time, therefor, is now an average of 13 minutes, rather than the 40 minutes required to do the single run!

Where is the bottleneck

Every program has a bottleneck, and as a program is made more efficient, all that happens is the bottleneck gets moved around. There are a few techniques that can be used to figure out where the bottlenecks might lie. For example, how long does it take to read the trace file without actually processing it? If it takes 25 minutes to run the simulation, but you find that it takes 22 minutes just to read the trace file, it's not going to help much to make the simulator algorithms more efficient - the most you can gain from that is about 3 minutes. It will pay off greatly, however, if you can find a way to improve the time it takes to read the trace file, either by switching to a binary trace (if using an ASCII trace) or by using a compressed trace file (see section whatever) if not doing that already.

If the bottleneck is not in reading the input (e.g. reading the input only takes 9 minutes, but the simulation takes 60 minutes to complete), then the next step is to determine where your program is spending most of its time. One of the tools for doing this is called 'gprof'. 'gprof' is a profiling tool which displays such information as the number of times a given function called another function, and how much of the running time was spent executing a particular function. By compiling your application with:

gcc -pg -g

(assuming a C program. Similar arguments are used for C++) you add the code necessary to profile how your program runs. After running the simulator, you will have an output file named something like 'gmon.out'. Running gprof on this file will display statistics on each of the functions, including the number of times they were called, and the amount of CPU time spent in each. By looking for those functions which consume the most CPU time (and/or get called the most often) we can find good candidates for functions that should be improved. For example, if you find that the function which does a linear search down a linked list is consuming the most CPU time, perhaps converting that structure to a hashed structure would provide a huge win.

Conclusion

This document has outlined a number of ideas, tips, and tricks for writing programs. Hopefully some of them will be applicable to the programs you write. This document is by no means comprehensive, and is based only on the personal experience of the author. Be aware that these techniques may not apply to your particular situation, and that your mileage with any of the above may vary.

Acknowledgements

Thanks to Martin Arlitt and Jayakumar Srinivasan for providing feedback and comments on early versions of this document.

Page last modified: July 15, 1999. Minor tweakage: May 30, 2001. Updated email addres to be anti-spam-friendly: October 29, 2003. Send comments to oster@cs.usask.ca.