This code generates 'perfect' mazes and solves them.
This code is just for fun. It generates a ‘perfect’ maze and solves it. The results are written out to a PNG file. For details, click here.
These are pictures of fractals generated by using Newton's method in the complex plane.
[img src=http://heidmann.com/wp-content/flagallery/fractals/thumbs/thumbs_fractalonezoomed.png]40f(z) = z^5 - 5.0*z^4 - 55.0*z^3 + 245.0*z^2 + 654.0*z - 2520.0
[img src=http://heidmann.com/wp-content/flagallery/fractals/thumbs/thumbs_fractaltwounzoomed.png]50f(z) = sin(z)
[img src=http://heidmann.com/wp-content/flagallery/fractals/thumbs/thumbs_fractaltwozoomed.png]30f(z) = sin(z)
This software uses Newton’s method (the one from any first year calculus course) in the complex plane to generate fractals. This code makes use of many of the new C++11 language features, so a newer compiler will be required (I used GCC 4.8.1 running on Opensuse). This code is documented thoroughly with doxygen.
How it Works
A rectangular region is first chosen in the complex plane. Then, the fractal is computed row by row (a row is a line with all points on the line having a constant imaginary component – a horizontal line). Each point on the row is used as the initial guess for Newton’s method. Newton’s method will eventually converge on one of the zeros of the fractal generating function. The color of the associated pixel is chosen by which zero Newton’s method finds. The shading is derived from the number of iterations needed to reach the zero.
Source and Documentation
The Quotient of Mutex time / Atomic time.[img src=http://heidmann.com/wp-content/flagallery/atomic_v_mutex/thumbs/thumbs_atomic.png]1020Atomic Lock Times
The amount of time that it took to get a fixed number of items through a queue, guarded by an atomic, with the given number of producer and consumer threads.[img src=http://heidmann.com/wp-content/flagallery/atomic_v_mutex/thumbs/thumbs_mutex.png]850Mutex Lock Times
The amount of time that it took to get a fixed number of items through a queue, guarded by a mutex, with the given number of producer and consumer threads.
Which is faster, a queue guarded by a std::mutex, or the same queue guarded by an atomic? This experiment attempts to at least partially answer that question. To do this, I constructed a queue that could be guarded by either a std::mutex or by my own atomic based lock. Then, I ran a fixed number of elements through the queue, with varying numbers of producer and consumer threads.
To be clear, this experiment was run several times (in single user mode). The number of elements that passed through the queue was the same for all runs. Each run had a fixed number of consumer threads and a fixed number of producer threads. Each run had a guard on the queue that was either a std::mutex or it was an atomic based lock. Each run was timed, and the results were recorded.
See especially the graphic titled “Mutex / Atomic”. Surprisingly, the atomic based lock out performed std::mutex in most cases. Only in those cases where there is much contention (that is, where there are many producer and consumer threads) did std::mutex do better. In these high contention cases, it is likely that those threads using an atomic lock are consuming quite a bit of time spinning, and you can see the resulting drop in performance (see the graph labeled “Atomic Serialization”). In those cases where there is low contention, it would appear that the atomic lock simply has less overhead than a std::mutex. Later, I’ll have to look at the C++11 standard, to see exactly what is required of a std::mutex. That might explain things.
Is available here.
Recently, I found myself wondering about cache interactions (specifically cache localization issues), and so I wrote a small bit of code to experiment a bit. The results are surprising to me, and I’m going to have to admit that I cannot yet explain the data. Please comment if you have any ideas.
The source code is quite simple. It sets up two STL containers, a vector and a map. The containers are then filled with a fixed number of elements of a fixed size, with the fixed size being very significant here. Then, a fixed number of searches for random elements are performed. When searching the vector, std::binary_search is used, so that the same O(log(n)) search performance is achieved as the map. These searches are timed. The above is then executed several times, with each subsequent run having an element size (in bytes) larger than the previous run. The impact of the size of the elements on the search time is significant, and mysterious.
The Source Code
The source code is available here. Please note that the scripts and makefiles are set up to build the code both for ARM and for x86. Likely, the makefiles and scripts will need modification before they work on your system. Also note that the code is a bit rough – I was experimenting, not trying to develop production code.
Please see the diagrams, where search time (the y-axis) is plotted against element size (the x-axis). The results achieved when executing on an ARM processor (in this case, executed on a Raspberry Pi) are not very surprising. There is little correlation between element size and search times. When the code is executed on a core i7 (running Linux), however, things get interesting. Clearly, there are cache interactions going on here. Note that the cache line size of a core i7 is 64 bytes, which, as you can see from the diagram, is where the best performance is achieved. Why the map takes a hit when the element size is 128 bytes is a mystery. Also, I am unclear as to why performance suffers when element size is less than the cache line size (64 bytes) for the vector. The performance results for sizes above 256 bytes per element also mystify me, for both map and vector.
I’ll be thinking about this. If I manage to figure anything out, I’ll post it here. Please post yourself if this makes any sense to you.
In the sample code section of this website, you may download code that I have written titled “throughput”. These pictures were produced by this code.
The throughput code attempts to measure the performance (or throughput) of a single queue with varying numbers of producer and consumer threads, in search of that “sweet spot” that maximizes throughput through the queue. To make this simulation realistic, producer and consumer threads are given both work and I/O to perform. The amount of work and I/O given to each thread varies according to a normal (Gaussian) distribution. For a description of the code, see: https://heidmann.com/?page_id=15.
The above pictures are from identical runs (that is, identical input parameters), with one run executing on my quad core intel i7, and the other executing on my raspberry pi. The color codes on the right of each picture indicate the number of work items that actually got through the queue. The X and Y axis are the number of consumer and producer threads.
One of the things that is very interesting (and a bit unexpected) to me is that the “sweet spots” are not nearly as far apart as I would have expected (with respect to the number of producer and consumer threads).
This weekend I finally got a raspberry pi, and I have to admit, I’m a bit excited. Since the beginning of my career as a software engineer, most of my time has been spent developing embedded software (some of it real time, some of it safety critical). My current employer, iTRACS, doesn’t do any embedded work, so this little “dev board” (that’s what we call boards purposed specifically for development – playing around, really) will give me the opportunity to keep my embedded skills from getting all rusty.
The raspberry pi is a computer, albeit a very small and cheap one (and it, for all practical purposes, is an embedded computer). The whole thing is only a bit larger than a credit card, has a 700MHz ARM processor, an ethernet port, two USB ports, and GPIO (general purpose I/O) pins (and a few other things). It does not have a disk drive, but rather boots of off an SD card (yup, the same ones that cameras use). My raspberry pi is currently running an optimized Debian Linux called “Rasbian”. It was one of the pre-built images available from raspberry pi. It will do for now (but who knows what I’ll do later).
Once I got the board off of its back (that is, got it to boot), I went about getting a cross tool chain working on my desktop Linux box. This will allow me to build (compile) programs on my big, beefy desktop and transfer the resulting executable to my “razz” directly. It is much better build on my desktop than to attempt it on the little pi.
Also, since I am a C++ developer, I make frequent use of the boost libraries. These I also had to cross compile (using my shiny new cross tool chain). It took a bit of doin’, but I eventually figured it out. I went ahead and built some C++ code I had previously developed, and, after downloading the new boost libraries to my “razz”, the code executed without a hitch!
I love it when a plan comes together!