Recently, I found myself wondering about cache interactions (specifically cache localization issues), and so I wrote a small bit of code to experiment a bit. The results are surprising to me, and I’m going to have to admit that I cannot yet explain the data. Please comment if you have any ideas.
The source code is quite simple. It sets up two STL containers, a vector and a map. The containers are then filled with a fixed number of elements of a fixed size, with the fixed size being very significant here. Then, a fixed number of searches for random elements are performed. When searching the vector, std::binary_search is used, so that the same O(log(n)) search performance is achieved as the map. These searches are timed. The above is then executed several times, with each subsequent run having an element size (in bytes) larger than the previous run. The impact of the size of the elements on the search time is significant, and mysterious.
The Source Code
The source code is available here. Please note that the scripts and makefiles are set up to build the code both for ARM and for x86. Likely, the makefiles and scripts will need modification before they work on your system. Also note that the code is a bit rough – I was experimenting, not trying to develop production code.
Please see the diagrams, where search time (the y-axis) is plotted against element size (the x-axis). The results achieved when executing on an ARM processor (in this case, executed on a Raspberry Pi) are not very surprising. There is little correlation between element size and search times. When the code is executed on a core i7 (running Linux), however, things get interesting. Clearly, there are cache interactions going on here. Note that the cache line size of a core i7 is 64 bytes, which, as you can see from the diagram, is where the best performance is achieved. Why the map takes a hit when the element size is 128 bytes is a mystery. Also, I am unclear as to why performance suffers when element size is less than the cache line size (64 bytes) for the vector. The performance results for sizes above 256 bytes per element also mystify me, for both map and vector.
I’ll be thinking about this. If I manage to figure anything out, I’ll post it here. Please post yourself if this makes any sense to you.
I forgot to mention, on the core i7, I am pretty confident that the code executed on only a single code (because the code is single threaded). Also, if it helps, the core i7 has 64KB per core of L1 cache, 256KB per core of L2 cache, and a bunch (somewhere between 4MB and 12MB) of shared L3 cache (and no, I don’t happen to know how much L3 cache my core i7 has. maybe I should query the proc tables someday, huh?).
OK, this is a bit embarrassing. I just discovered that in the code, I failed to account for the search key in the element size. That means that all of my elements are larger than I thought that they were, and it means that they are not multiples of the cache line size. That may be part of the reason for the goofy interactions that I’m seeing. I’ll re-code and update here.