Category Archives: Software

Fractals

Fractals

These are pictures of fractals generated by using Newton's method in the complex plane.

f(z) = z^5 - 5.0*z^4 - 55.0*z^3 + 245.0*z^2 + 654.0*z - 2520.0
f(z) = z^5 - 5.0*z^4 - 55.0*z^3 + 245.0*z^2 + 654.0*z - 2520.0
f(z) = z^5 - 5.0*z^4 - 55.0*z^3 + 245.0*z^2 + 654.0*z - 2520.0
f(z) = z^5 - 5.0*z^4 - 55.0*z^3 + 245.0*z^2 + 654.0*z - 2520.0
f(z) = sin(z)
f(z) = sin(z)
f(z) = sin(z)
f(z) = sin(z)

This software uses Newton’s method (the one from any first year calculus course) in the complex plane to generate fractals. This code makes use of many of the new C++11 language features, so a newer compiler will be required (I used GCC 4.8.1 running on Opensuse). This code is documented thoroughly with doxygen.

How it Works

A rectangular region is first chosen in the complex plane. Then, the fractal is computed row by row (a row is a line with all points on the line having a constant imaginary component – a horizontal line). Each point on the row is used as the initial guess for Newton’s method. Newton’s method will eventually converge on one of the zeros of the fractal generating function. The color of the associated pixel is chosen by which zero Newton’s method finds. The shading is derived from the number of iterations needed to reach the zero.

Source and Documentation

Here

Atomic Locking vs. Mutexes

atomic_v_mutex

Quotient of Times
Quotient of Times
Atomic Lock Times
Atomic Lock Times
Mutex Lock Times
Mutex Lock Times

The Question

Which is faster, a queue guarded by a std::mutex, or the same queue guarded by an atomic? This experiment attempts to at least partially answer that question. To do this, I constructed a queue that could be guarded by either a std::mutex or by my own atomic based lock. Then, I ran a fixed number of elements through the queue, with varying numbers of producer and consumer threads.

To be clear, this experiment was run several times (in single user mode). The number of elements that passed through the queue was the same for all runs. Each run had a fixed number of consumer threads and a fixed number of producer threads. Each run had a guard on the queue that was either a std::mutex or it was an atomic based lock. Each run was timed, and the results were recorded.

The Results

See especially the graphic titled “Mutex / Atomic”. Surprisingly, the atomic based lock out performed std::mutex in most cases. Only in those cases where there is much contention (that is, where there are many producer and consumer threads) did std::mutex do better. In these high contention cases, it is likely that those threads using an atomic lock are consuming quite a bit of time spinning, and you can see the resulting drop in performance (see the graph labeled “Atomic Serialization”). In those cases where there is low contention, it would appear that the atomic lock simply has less overhead than a std::mutex. Later, I’ll have to look at the C++11 standard, to see exactly what is required of a std::mutex. That might explain things.

The Source

Is available here.

Cache Interactions Mystery

cacheInteractions

The Question

Recently, I found myself wondering about cache interactions (specifically cache localization issues), and so I wrote a small bit of code to experiment a bit. The results are surprising to me, and I’m going to have to admit that I cannot yet explain the data. Please comment if you have any ideas.

The Code

An Overview

The source code is quite simple. It sets up two STL containers, a vector and a map. The containers are then filled with a fixed number of elements of a fixed size, with the fixed size being very significant here. Then, a fixed number of searches for random elements are performed. When searching the vector, std::binary_search is used, so that the same O(log(n)) search performance is achieved as the map. These searches are timed. The above is then executed several times, with each subsequent run having an element size (in bytes) larger than the previous run. The impact of the size of the elements on the search time is significant, and mysterious.

The Source Code

The source code is available here. Please note that the scripts and makefiles are set up to build the code both for ARM and for x86. Likely, the makefiles and scripts will need modification before they work on your system. Also note that the code is a bit rough – I was experimenting, not trying to develop production code.

The Results

Please see the diagrams, where search time (the y-axis) is plotted against element size (the x-axis). The results achieved when executing on an ARM processor (in this case, executed on a Raspberry Pi) are not very surprising. There is little correlation between element size and search times. When the code is executed on a core i7 (running Linux), however, things get interesting. Clearly, there are cache interactions going on here. Note that the cache line size of a core i7 is 64 bytes, which, as you can see from the diagram, is where the best performance is achieved. Why the map takes a hit when the element size is 128 bytes is a mystery. Also, I am unclear as to why performance suffers when element size is less than the cache line size (64 bytes) for the vector. The performance results for sizes above 256 bytes per element also mystify me, for both map and vector.

So…

I’ll be thinking about this. If I manage to figure anything out, I’ll post it here. Please post yourself if this makes any sense to you.

A Picture’s Worth a Thousand Work Items

In the sample code section of this website, you may download code that I have written titled “throughput”. These pictures were produced by this code.

thruPut

 

The throughput code attempts to measure the performance (or throughput) of a single queue with varying numbers of producer and consumer threads, in search of that “sweet spot” that maximizes throughput through the queue. To make this simulation realistic, producer and consumer threads are given both work and I/O to perform. The amount of work and I/O given to each thread varies according to a normal (Gaussian) distribution. For a description of the code, see: https://heidmann.com/?page_id=15.

The above pictures are from identical runs (that is, identical input parameters), with one run executing on my quad core intel i7, and the other executing on my raspberry pi. The color codes on the right of each picture indicate the number of work items that actually got through the queue. The X and Y axis are the number of consumer and producer threads.

One of the things that is very interesting (and a bit unexpected) to me is that the “sweet spots” are not nearly as far apart as I would have expected (with respect to the number of producer and consumer threads).

Comments, anyone?