Writing Efficient Software
With the advent of large physical memories and large disks, the trend
continues to be to care little about the memory/space requirements of
programs and their data, and to "just code". While this approach is
easily seen in commercial software products, the problems of program
bloat and inefficiencies also crop up in many smaller projects.
Simulators built today to handle 1 million requests per day without
difficulty suddenly break down when 10 million requests per day
becomes the norm. And while allocating more disk or RAM to the
program may give it a new life for a short time, these types of
solutions can only go so far.
In this document we look at how and why programmers should be
concerned with making good use of both in-core and on-disk memory,
with a small emphasis on web-related simulators. While the
programming techniques are expressed in C and C++, the basic
principles should also apply to other languages like Perl.
Background
It is easy to get caught up in the "heat of the moment" when designing
a new piece of code. With a reasonably sized set of input data, and
shiny new hardware, it's easy to say things like:
We'll use doubly linked lists, and have a
dynamic binary tree, and the data will be read in from a flat ASCII
file. We've only got 3 weeks of data to handle, and that is only 10
million entries. The machine has 128MB of RAM, and we have lots of
disk.
Unfortunately, this sort of thinking leads directly into problems. Is
the doubly linked list necessary? Will the dynamic binary tree remain
close to balanced? Will it make good use of the space? Will the
traces ever get heavier, to the point that the simulator will no
longer have enough RAM to run efficiently? Is an ASCII format the
best format for this trace? When the simulator needs to handle 10
weeks of data, and 100 million entries, will it still cope?
These are the sorts of questions which should be asked when designing
programs, and these are the issues that are discussed in this document.
The prompting to write this document came from realizing that many of
the tricks, tips, and techniques used in our web server and web proxy
simulators to improve their performance were not documented anywhere.
These simulators are trace-driven, and the size of the trace files
(and the functionality of the simulator) only seems to grow over time.
Thus, much of the information here is based on the things we had to do
to make our simulators more efficient, and capable of handling large
amounts of trace data (e.g. 1GB compressed) as quickly as possible.
Since the raw data files (web server or web proxy traces) were
typically too large (multiple GB), we also needed to find efficient
ways of converting that data down into something that was more
managable in size. So while at one level we analysed web server and
web proxy performance as part of our research, at another level we
were also analysing the performance of our simulator and associated
tools (albeit less rigorously), in an attempt to make everything
faster and better.
Choice of languages
Most of the tips and tricks presented here are applicable to C and
C++, but when writing simulators it may be useful to consider Perl.
Perl, which is typically interpreted, is a very powerful scripting
language. What takes only a few lines in perl can require pages of C
or C++ code - in particular, text manipulation is very easy to handle
in Perl. Thus, Perl makes a very good pre-processor for trace files,
and can be used, for example, to strip out unwanted or un-needed
information from ASCII traces. A huge advantage of Perl is the fact
that it is interpreted - this allows easy modification and execution
of the scripts without tedious re-compiles.
Object-oriented languages, if used properly, can have large impacts on
the scalability and re-usability of simulator code. Cache objects,
trace objects, and even simulation objects make the job of handling
multiple types of caches and multiple types of trace input much
easier. With a little bit of inheritance, the simulator only needs to
be concerned about querying a cache as to it's contents, rather than
caring about whether the cache is LRU, LFU, or whatever. From the
input file perspective, each input format is hidden inside of an
object, again such that the simulator only needs to ask for the 'next
trace record', which can be done in a trace-independent fashion.
Out of Memory!
Nearly all modern operating systems have a concept of virtual memory,
allowing programs to use more RAM (at least virtually) than is
available in the system. In many cases, however, per-user resource
limits may be well under the maximum limits supported by the operating
system. So while one should always advocate making programs run
efficiently, it may still be necessary to look at what your 'limits'
are. On many systems, the command
limit
will display something that looks like:
cputime unlimited
filesize unlimited
datasize 131072 kbytes
stacksize 2048 kbytes
coredumpsize unlimited
memoryuse 58664 kbytes
memorylocked 19556 kbytes
maxproc 1044
openfiles 64
Now you might think this should be enough - after all, it says that
programs and data can occupy 128MB and that the program stack can be 2MB.
However for large simulations, and on large-memory machines, such a
restriction may be unwarranted. This 'problem' can be solved with:
limit datasize unlimited
limit stacksize unlimited
which now gives:
cputime unlimited
filesize unlimited
datasize 1048576 kbytes
stacksize 32768 kbytes
coredumpsize unlimited
memoryuse 58668 kbytes
memorylocked 19556 kbytes
maxproc 1044
openfiles 64
As is seen here, the program now has 1GB of virtual memory to work in,
and a stack size of 32MB. On a machine with 256MB of real RAM, this
means the simulator can now use the other 128MB that would have been
otherwise unavailable.
Note that increasing the stack should not really be necessary unless
your program has a lot of recursion going on. Additional discussion
about function calls, and local variables, both of which affect the
stack memory usage, is deferred until later.
Note as well that some administrators will set the cputime value above
to be some fixed value - e.g. 2 hours. What this does is restrict the
amount of CPU time that any given program gets to that fixed value.
Once that time has elapsed, the system terminates the program. While
this can be very useful for taking care of the problem of runaways
(programs that eat up CPU and memory resources, and do not do anything
useful, but who's owners may not know the programs are running), it
can be problematic for a simulation that may need overnight to run.
So a:
limit cputime unlimited
may be in order before running those time-consuming simulations.
Breaking the paging barrier
While virtual memory allows us to write programs which execute with
memory sizes much larger than physical RAM, there is a speed price
which is paid every time part of the program or data set is "paged" in
or out. Any good book on Operating Systems will discuss paging in
more detail, but here we refer to paging as simply moving a chunk of
memory to/from some sort of "secondary storage" (be it a local swap
disk, a network-mounted swap device, etc.) For the sake of this
discussion, the less paging that happens with your program, the
better. Programs which do not page will typically run noticeably
faster than those which do. If your program is paging when it runs,
then there may be a number of solutions:
- Buy more memory
- Kick other programs/people off of the machine
- Be more efficient
In many cases, the latter is the only real option, and the former two
"options" only serve to delay the problem (and/or irritate colleagues!).
Data allocation: Static vs. Dynamic
There are some interesting trade-offs between allocating all the
required data structures at once at the beginning, vs. allocating each
one on-the-fly. Allocating all at the beginning can be useful if you
known in advance how many you will need, and if the allocation does
not start your program paging from the very start! If the allocation at
the start does cause your program to page, and/or a large number of
the structures will not be used for most of the duration of the run,
then perhaps a different allocation scheme is in order.
Dynamic allocation done 'one-at-a-time' will help ensure that your
program never allocates more memory than it intends on using, but at
the expense of continually calling malloc/new. It is important to
note here that calling malloc/new frequently is a very
expensive operation. Although the efficiency of this operation has
improved in recent years, it is still not nearly as efficient as
pre-allocation of a large number of structures.
In a slightly different vein, if an algorithm looks like:
while(!done) {
if ( (space = malloc( sizeof(struct foo) )) ==0) { ... }
initialize(space);
...
free(space);
}
you will likely experience a *huge* speed increase by changing that
to:
if ( (space = malloc( sizeof(struct foo) )) ==0) { ... }
while(!done) {
initialize(space);
...
}
free(space);
A nice tradeoff to the "do everything at the beginning" vs. "one at a
time" is to allocate space in "banks". Each "bank" is an array of
some number of structures. Each time a new structure is needed, a
pointer is returned to the "next" element in the array. Once the
array is used up, a new array is allocated, and the process continues.
If memory is constantly being allocated and freed, the bank algorithm
can be used even more effectively. When a structure is freed, a
pointer to that structure can simply be put at the head of a
linked-list (called the "free list"). When a new structure is
requested, the allocation routine first looks to see if there are any
freed structures already on the free list. If there are, one is
returned from the free list. If not, a new one is returned from the
array.
By tuning the size of the banks (i.e. the number of structures
allocated to each array), it is possible to require only 3 or 4 bank
creations during the execution of a program, while at the same time
having very fast allocation/freeing of the structures (O(1) in both
cases, ignoring the overhead for allocating a new bank). By
allocating banks at a time, efficiencies are achieved by only calling
malloc/new as necessary, space is easily (and quickly)
allocated/freed, and yet the space requirements are only slightly
larger than would be required at any given point (i.e. it allocates at
most a bank worth of space more than would be required at that point
by an on-the-fly allocation scheme).
If the overhead of maintaining the "bank" structure seems too great,
it may be useful to simply maintain the free list, building it from
previously malloced structures that have been freed. This has the
benefit of not pre-allocating nearly as many structures, yet allowing
fast access to space for which a malloc has already been performed.
Also note that repeated mallocing and freeing can sometimes cause the
memory to become fragmented. While the memory allocaters may try to
prevent this, programs which use malloc and free as infrequently as
possible will typically experience the least amount of memory
fragmentation. So in instances where you have allocated 2MB for a
data structure, and you know it'll be needed in a little while, it may
be useful to not free that space, and to then simply re-use it the
next time the function is called.
Where is your data stored: Memory Heap vs. Stack
Although the need for and use of global and local variables are well
understood, where space for a variable is created may be less well
understood. Global variables have space allocated for them on the
general memory heap. Local variables, that is, those variables
declared within a function, have their space allocated on the run-time
stack. This means, for example, that if you declare a local variable
that is an array of 2000 structures, then that entire array will be
stored on the stack. Now while a single use like this might be
feasible, recursively calling this function is going to require using
a large amount of stack space. Since the stack size is typically
small (e.g. 2MB), it is easy to run out of stack with a heavily
recursive program. A better approach may be to malloc/new the space
upon the invoking of each function. Malloc/new'ed space is allocated
on the heap. Note that local variables within 'main()' are subject to
these same restrictions. Thus if you find your program crashing, and
it has a lot of recursion, check your stack limit, and check the
amount of data you are storing in local variables.
Parameter Passing
A common problem related to the use of the run-time stack is the way
that data is passed from one function to another. For example, it is
*much* less expensive (both in time and space) to pass pointers to
large data structures, rather than passing the entire data structure
by value. Passing the structures by value requires a lot more stack
space, which, as already mentioned, is typically in short supply.
If the structures being passed down are small (e.g. only a single
integer or a couple of integers), then passing down by value probably
makes sense. But anything larger than that should be typically passed
down via a pointer (or other reference).
Memory Leaks
If your program is using large amounts of memory, you may want to
verify that you do not have a memory leak. A simple way to do this (in
C) is to simply record (via a couple of global variables) the number
of mallocs performed, and the size of data allocated, and to compare
that at the end of the run with the number of free's performed, and
the amount of data freed. If the instrumentation is done correctly,
and you can account for all of the memory used, then chances are good
that you do not have any memory problems. If, however, you seem to
have a much higher number of mallocs than you have free's, it may be
worth looking into why that is the case.
Trees: Size vs. Balance
The average binary tree is a well known and well studied data
structure. Known for its O(log n) insertion and retrieval times, the
tree can be quite adept at handling data. There are a number of
issues to keep in mind, however, when designing how the tree will be
constructed.
Dynamic vs. Static Trees
If the size of the tree is known in advance, then it may be useful to
pre-allocate the entire tree in a linear array, and do away with
pointers entirely! There are a number of large gains here:
- A savings of two pointers per node
- A guarantee of O(log n) time for insertions, deletions, and lookups
- A guarantee of balance
- Straightforward algorithms
If the size of the tree is not known in advance, it still may be
possible to use an array. Once the array fills up, a new (larger)
array is created (could be 2x the former size, or some fixed size
larger). The old array contents are then copied into the new array,
and the old array is removed, and the program then uses the new
array. (Note that if you have pointers to any of the elements in the
old array, this method requires that those pointers be updated
appropriately!!! This may not be a trivial undertaking!)
Otherwise a fully dynamic tree may be required. The dynamic tree
typically has the advantage of only being as large as is necessary to
hold all of the current data, but has large disadvantages if the tree
is unbalanced (not to mention the difficult balancing algorithms that
may need to be used if balance becomes a factor).
Singly Linked Lists
While traversing a singly linked list can take O(n) time to find a
particular item, singly linked lists can be a very efficient way of
dealing with queues or stack structures. If you can always delete
from the front, and can know that the insertion will be always at the
front or at the back of the list, then both insertions and deletions
can be performed in O(1) time! Insertions at the tail may require
keeping a pointer to the tail (so you do not have to traverse through
the list to find the end!), but that is a small price to pay for O(1)
inserts.
Do not be tempted to make a singly linked list doubly linked just for
the sake of having a doubly linked list. If you are forced to
traverse the list from beginning to end to search for an item to
remove, and think that having reverse pointers will make removing that
item easier, you may be adding more overhead than you really need. If
you are having to traverse the list searching anyways, then a "chasing
pointer" (i.e. pointer to the previous item) may be just as effective.
Doubly Linked Lists
Doubly linked lists can be very efficient if there is a need to
traverse both directions in a list. The most useful case is when you
are handed a pointer to the item, and need to find the items on either
side (e.g. for a deletion). There are cases, however, where a doubly
linked list has been used where the "reverse pointer" was never used!
There is an ideal place to save a pointer worth of data in each
structure on the list, and make the list singly linked.
Heaps
Never underestimate the power of the heap for holding data that must
be accessed in a sorted fashion. Heaps are especially useful for
event-based simulations, where an event heap can be used to store the
set of pending events. If done as an array, insertions are guaranteed
to be O(log n), queries for the next event are O(1), and deletions are
O(log n).
Hashing
It's well known that a hash-based search can often do much better than
a linear or binary search, but it's important to stress this fact
again. If hashing results in a O(n + e) rather than O(n log n) algorithm,
that improvement by a factor of log n can really add up if n is
10000000 elements (factor of 23, assuming log_2 n.) and 'e' is a small
constant.
Endian Issues
The two most common endian formats are Big Endian and Little Endian.
The only comment made here is to ensure that where possible, the
traces you use should be of the same endianness as the CPU
architecture so that your program does not have to do on-the-fly
endian conversions as each trace record is read.
To point, or not to point, that is the question
Pointers are used to refer to some other data object or collection of
data objects. There are times, however, when as natural as it might
seem to use a pointer, the use of a pointer is actually prohibitively
expensive. Consider an array of n pointers, and a set of n data
elements, each pointer pointing to a single element. If the size of
the pointer is 4 bytes (8 bytes on a 64-bit machine) and the size of
the data element is 100 bytes, then the use of the pointer is probably
warranted. If, however, the data element is only 4 bytes, or even 8
bytes, then perhaps the data structure would be better done as a
single array of the data elements. The problem here is that for every
4 bytes of data, there are 4 bytes of pointer overhead - that's a 100%
overhead in the amount of space required to hold the data structure,
just due to the pointers alone. The big question here is of what
benefit are those pointers (yes, there may be instances where this is
what you need), and why should you continue to use them? If the data
structure pointed to is 8 bytes, then there is a 50% overhead for
pointers. Bad, but not the end of the world if it makes the program
simpler in some other respect. Anything larger than this likely means
that pointers are fine - but not always! If you have a 20 byte
structure, and you know there are only going to be at most 2000 of
them needed in the program, allocate all of them in a single array!
There is no need to put pointers in the array, and then dereference
things all of the time. Rather than having 2000 4-byte pointers in an
array, and 2000 20-byte data structures in the general storage space,
just allocate the 2000 20-byte data structures in the array and be
done with it. In this case, there is a 17% saving in space on this
data structure, just by eliminating the pointers. (Think of how much
real memory that is when the number of elements is 10 million, rather
than 2000. Hint: It's close to 40MB of space, which if that's enough
to mean that your program is no longer paging, will be a huge speed
increase as well).
If pointers are necessary, it may be possible to still save a little
space. For example, rather than having a structure that looks like:
struct foo {
int key;
int value;
struct foo *next;
};
if may be possible to use something like:
struct foo {
int key1;
int value1;
int key2;
int value2;
int key3;
int value3;
itn key4;
int value4;
struct foo *next;
};
In this case, the cost may be the need to look at all the 'keys', but
the savings is in not having to maintain 3 additional pointers.
Again, this is a time-space tradeoff, and whether or not to use this
technique will depend on the application.
Grouping data
In looking at the following two structures, it would appear that both
contain the same data, and take up the same space:
struct foo1 {
char a;
int b;
char c;
int d;
char e;
char f;
};
and
struct foo2 {
int b;
int d;
char a;
char c;
char e;
char f;
};
Due to the way that data gets aligned by the compiler, however, the
above two structures are not the same. On a 32-bit int system,
foo1 will be 20 bytes, while foo2 will be a mere 12 bytes. In
general, you want to keep the variables arranged such that for a given
set of 32 bits (or 64 bits), that as many variables as possible are
packed into that space. In foo1, 'a' requires a full 32 bits, becase
32 bits is the natural word size of the processor, and both 'a' and
'b' cannot be put into the same 32-bit space ('b' is already
32-bits!). In foo2, however, 'a', 'c', 'e', and 'f' can all be packed
into the same 32-bit location since a) they are all next to each
other, and b) each is only 8-bits (the compiler knows how to stuff
multiple smaller structures into the 32-bit space). Similar behaviour
can be obtained from other data structures which are smaller than the
natural word size of the processor.
What's in the data
As hinted to, it is important to study the data structures in use
(especially those that get used a lot!) to make sure that they
are efficient. For example: why use a 32-bit value when a 8-bit or
16-bit structure would suffice? If the data range is from 0 to 255,
and there are 10 million entries, using a 32-bit value to represent
the data results in 30MB of wasted space! Examine true/false flags in
the structures. If you have:
char flag1;
char flag2;
char flag3;
char flag4;
perhaps you can get away with:
char flags; /* bit 0 is flag1, bit 1 is flag2, etc. */
Note that while this is still wasting 4 bits of the char, it is better
than wasting 28 of 32 bits! Again, with large data sets, a simple
change like the above can save 30MB on 10 million entries.
File formats and File IO
A simple ASCII flat file is a wonderful format - it is typically
relatively straight forward, well understood, and easy to read with a
variety of standard tools. 'grep', 'awk', 'perl', and other
utilities can be used to pick out appropriate bits of data.
Unfortunately, the simplicity of the format has it's drawbacks:
- The data file is typically larger than a binary format would be -
e.g. up to 4 ascii values may be represented as a single byte in a
binary trace.
- Larger trace files typically mean more time required to read in
the trace - especially with functions like fscanf().
By converting an ASCII trace into a binary trace, a smaller trace file
can be achieved, and that smaller file can be read into a simulator
much more quickly.
The performance improvement that we experienced when going from using
fscanf() to fread(), and from an ASCII trace file to a binary trace
file was nearly an order of magnitude. We went from 3 hours down to
18 minutes to read the trace - at which point computation, not IO,
became the bottleneck.
It should be noted that many architectures may require padding data
structures to 32-bit boundaries (i.e. rounded to the nearest 32-bit
quantity). Thus it may be useful to take:
struct trace_item {
int field1;
int field2;
char field3;
char field4;
}
and add padding (in this case, two bytes):
struct trace_item {
int field1;
int field2;
char field3;
char field4;
char pad1;
char pad2;
}
This will ensure that the structure is nicely aligned, and will avoid
problems when writing to and reading from a trace file.
Input sources
Where possible, read data from the local disk, rather than from a
networked disk. For modern systems, data can be accessed much more
quickly from the local disk, and others sharing your network segment
will not get upset. Most modern machines with 10Mbps Ethernet can
easily saturate such a network with simple NFS transfers. This is not
a good way to gain friends.
Compressed Input
If disk bandwidth seems to be a bottleneck, your input data may
benefit from being compressed. Compression of data on-disk has two
benefits:
- Smaller disk usage.
- Less data to read from disk, which means the total data can be
read in faster if IO is the bottleneck
If IO really is the bottleneck, then you will be reading less data
from the disk, and the unused CPU cycles can be used to uncompress the
data on-the-fly before giving your program the input it expects.
Reading data from a compressed file is relatively easy. After
determining that the file is in fact compressed (via filename
extension, or a flag, or whatever), your program can do a simple
'popen()' instead of a 'fopen()':
if (compressed) {
sprintf(pipe,"gzip -dc %s",filename);
if ((infile=popen(pipe,"r"))==NULL) {
cerr << "Unable to open pipe for input.\n";
exit(1);
}
} else {
if ((infile=fopen(filename,"r"))==NULL) {
cerr << "Unable to open " << filename << " for input.\n";
exit(1);
}
}
After this code is run, the file handle 'infile' will look and behave
like a normal sequential file - it can be read as normal by fread(),
and your program will not need to even know that the file is
compressed!
For a 1GB uncompressed trace that compresses to 256MB, the time
savings to just read the trace from disk can be phenomenal!
Program Status Outputs
If your program prints out volumes of information regarding it's
current status, you may be running much slower than you need to be.
To test this, run your program with its output redirected to nowhere:
./myprogram >& /dev/null
and time how long it takes. Compare that time with how long it
usually takes. The difference is the time required to do all of that
IO! By removing the extra printf's or cout's, the running time of a
program can be dramatically improved, especially if the amount of data
being output is quite large. Some techniques for avoiding printing
volumes of data include:
- Print out status indicators less frequently. If you print out an
"I'm here" message every 1000 trace records, perhaps only print it out
every 10000 trace records
- Save the output until the end. Hold off on printing stuff out
until the program completes. Depending on how much data you really
need to get out of the program, and/or the type of data, this approach
may not be feasible. But if you are not going to use that running
average, why bother printing it?
- If you really must collect a lot of data, consider writing it to a
(compressed) file, instead of just to the standard output.
- If the standard output is the place you want the data to go, then
consider simply re-directing the output to a file. Chances are that
the file IO will be much faster than your terminal IO.
Natural data structures
While this document has talked about using the most efficient size for
a data structure (e.g. use only a single character if your data values
are in the range 0-255), there are cases where the most time
efficient data structure may not be the most space efficient
structure. Many CPU's have a "natural word size" that is their most
efficient data size. To use any other data size requires either
special handling of the data, or may require other special operations
in order for the data to be used properly. Consider, for example, the
partial code fragment:
signed char x;
for(x=-40;x<100;x++)
On a CPU with a natural word size of 32-bits, it may actually be more
expensive to use a 8-bit char instead of 32-bit int. There are a number of
reasons for that:
- When transferring x to/from memory, does the CPU have to do
conversions from one size to another. E.g. are there special bus
alignment requirements that require shifting these 8-bits in some
non-natural way.
- How does the CPU deal with the sign bit of the value -40? The
most significant bit is usually the sign bit, but on a 8-bit char
value, how much extra work will the CPU have to do to make sure that
it can detect that the char is negative or positive?
Simple counters, therefor, should likely be done in the natural word
size of the CPU. If the variable is put into a CPU register there is
no saving in storage, and may be an execution penalty.
Incremental Methods and Partial Computations
The basic premise of incremental algorithms is to take a data
structure in a known state, apply some small change to it, and have
the data structure in another well known state, without having to
re-do all the computations that were necessary to get it into the old
state. For example, to maintain a running average of values being
read in, rather than maintaining each value in an array, and summing
(and counting) the array every 10 seconds, simply maintain a current
sum and a current count that is updated each time a new element is
added to the array. While this sounds obvious and trivial, the same
principles can be applied to things like computing standard deviations,
or keeping track of averages of items that are being added and
removed. The trick with incremental algorithms is to realize that the
differences between the two different "states" of the data structures
are very small, and that previous computations do not always need to
be re-done.
The basic idea with partial computations is to do as little work as
you have to at the current stage - especially if the work that you do
now will simply be re-done, overwritten, or ignored at a later stage.
In our web-server simulator, each request being served is given its
portion of the bandwidth for the given time that has elapsed. The
requests are sorted by size, in ascending order. If the per-request
bandwidth is larger than a requests remaining bytes to be served, then
that request finishes in the given time interval. What we also know,
however, is that if the per-request bandwidth is smaller than the
bytes remaining to be served for the current request, that none of
the requests in the remainder of the list will finish this time
either! So we use a scheme where we could stop handing out bandwidth
to any request that was not going to complete in this time interval.
For a request of 2048 bytes, there was no point in doing 8
subtractions of 256 bytes when a single subtraction of 2048 bytes
would accomplish the same result.
Scalability
Many programs get written on an ad hoc basis, where the data set is
limited, and the amount of functionality required is small. Some of
these programs, however, get forced to deal with data sets that are
much larger than the original sets, and this is where the issues of
scalability surface. When designing a simulator or other program, it
may be necessary to think about how the input data set may change.
What happens when the trace goes from 10 million requests in 3 weeks
to 20 million requests in one week - can the simulator still handle
the 60 million requests for the 3-week period? Is the simple sorting
algorithm being used going to be able to handle 100,000,000 items?
Will that linked list still be efficient if it becomes larger than 20
items? If a limit on the size of the input set can be known and
accounted for, then some of these questions become easy to handle. In
many other cases, however, the size of the input data set will not be
known in advance, and it is up to the program designer to make sure
that the program will scale as necessary.
Multiple simulations from a single trace
Depending on the amount of RAM required to run a single simulation, it
may be beneficial to perform more than one simulation in a single
run. For example, if 30 minutes are required to read in a trace, and
it only takes 10 minutes to do the simulation computation, then 75% of
the processing time is just IO overhead. If you can run 3 simulations
simultaneously, using the same trace as the single input source, the
entire run may take 60 minutes total (30 + 3 * 10), but now only 50%
of the running time is IO overhead. By adding additional simulation
runs, the cost of reading the trace can be amortized over all of the
simulation runs. Thus in this example, only 10 minutes of reading the
trace would be charged to each instance of the program. If 10
simulations were run simultaneously, then the run would only take 90
minutes, and the cost per simulation of reading the trace would be a
mere 3 minutes. The per-simulation run-time, therefor, is now an
average of 13 minutes, rather than the 40 minutes required to do the
single run!
Where is the bottleneck
Every program has a bottleneck, and as a program is made more
efficient, all that happens is the bottleneck gets moved around.
There are a few techniques that can be used to figure out where the
bottlenecks might lie. For example, how long does it take to read the
trace file without actually processing it? If it takes 25 minutes to
run the simulation, but you find that it takes 22 minutes just to read
the trace file, it's not going to help much to make the simulator
algorithms more efficient - the most you can gain from that is about 3
minutes. It will pay off greatly, however, if you can find a way to
improve the time it takes to read the trace file, either by switching
to a binary trace (if using an ASCII trace) or by using a compressed
trace file (see section whatever) if not doing that already.
If the bottleneck is not in reading the input (e.g. reading the input
only takes 9 minutes, but the simulation takes 60 minutes to
complete), then the next step is to determine where your program is
spending most of its time. One of the tools for doing this is called
'gprof'. 'gprof' is a profiling tool which displays such information
as the number of times a given function called another function, and how
much of the running time was spent executing a particular function.
By compiling your application with:
gcc -pg -g
(assuming a C program. Similar arguments are used for C++) you
add the code necessary to profile how your program runs. After
running the simulator, you will have an output file named something
like 'gmon.out'. Running gprof on this file will display statistics
on each of the functions, including the number of times they were
called, and the amount of CPU time spent in each. By looking for
those functions which consume the most CPU time (and/or get called the
most often) we can find good candidates for functions that should be
improved. For example, if you find that the function which does a
linear search down a linked list is consuming the most CPU time,
perhaps converting that structure to a hashed structure would provide
a huge win.
Conclusion
This document has outlined a number of ideas, tips, and tricks for
writing programs. Hopefully some of them will be applicable to the
programs you write. This document is by no means comprehensive, and
is based only on the personal experience of the author. Be aware that
these techniques may not apply to your particular situation, and that
your mileage with any of the above may vary.
Acknowledgements
Thanks to Martin Arlitt and Jayakumar Srinivasan for providing
feedback and comments on early versions of this document.
Page last modified: July 15, 1999.
Minor tweakage: May 30, 2001.
Updated email addres to be anti-spam-friendly: October 29, 2003.
Send comments to oster@cs.usask.ca.