24 February 2018

Trashing the cache

Recently I switched jobs. I now pretend to know a thing or two about game development as a lecturer at Howest DAE.

One of the courses I'm teaching has the beautiful name "Programming 4". The topics are software design patterns, memory management, threading and networking.

The first week we talked about the cache and the CPU pipeline. I have always known about these things, but it wasn't until I had to explain the concepts to students that I really took an in-depth look at it.

Integer array

When memory is used in your code, it gets fetched from main RAM memory into the cache of your CPU. When you need a single integer it wont go fetching that single value, instead it will fetch an entire cache line, containing - among other things - the integer you requested. If the data next to the integer is used next by your program, it won't have to be fetched anymore, you can use it immediately. That's a cache-hit. Instead when the data is not there and you need another fetch from memory, that's a cache miss. Check out the data locality chapter from Game Programming Patterns by Robert Nystrom for more details.

The first assignment was to loop over an array of integers with increasing step size, inspired by this blogpost. The goal is to illustrate the impact of cache-misses, where we need to fetch data from memory into the cache instead of having a cache-hit, where we use the data already in the cache.

Consider this loop:

const unsigned long length = 64 * 1024 * 1024;
const auto arr = new int[length];
for (auto k = 1; k <= 1024; k = k * 2)
{
 for (auto i = 0; i < length; i += k)
  arr[i] = i;
}

We measure how long it takes to go through the array and perform an operation on each value, with an increasing step size of powers of two. That gave me these results:

The vertical axis is the time it took to go through the array, the horizontal axis is the step value. The red line is what we would expect: every time we multiply the step size with two. Thus, every step we only do only half as much multiplications as before, we would expect the time to be cut in half too. But this is not the case.

Instead, we decline to some kind of minimum value, and only once the step size is over 16 we start to cut the time in half with every step. What happens?

My cache line size is 64 bytes, so we can fit 16 integers in one cache line. When we loop over the array 16 integers will be loaded at once in the cache. Consider this image:

With step size one, we use every integer in the cache line. With step size 2 we use every other integer, step size 4 every fourth integer and so on. With step size 32 however we skip an entire cache line fetch that we needed before, with step size 64 we'll skip three cache fetches etc. It's only when we use fewer catch fetches that our timings lower!

In other words, although we did less and less computations with every step size, we could not get faster than the time it takes to fetch the array from memory into cache. We were "cache-bound".

GameObjects

So instead of integers, let's have an array of these GameObjects and perform an operation on the ID field:

struct Transform
{
 float matrix[16] = { 
  1,0,0,0,
  0,1,0,0,
  0,0,1,0,
  0,0,0,1 };
};

class GameObject3D
{
public:
 Transform transform;
 int ID;
};

That results in this graphic:

This time it almost follows the expected line. It confused the students: why is there no bump now? The reason is this: the GameObject3D is 68 bytes large - we need to fetch at least two cache lines from memory to get a single GameObject3D in the cache. This means that when we process half the number of GameObjects, we also need half the number of cache lines filled, thus it will take half as much time.

Then consider this alternative:

class GameObject3DAlt
{
public:
 Transform * transform;
 int ID;
};

We now have a pointer to the transform instead of the entire struct. This results in this graphic:

Ah, there is our bump again! But isn't this worse than the previous one? Not really - compare them:

The alternative game object is processed way faster now, because we only need one memory fetch to process 8 game objects (the object is 8 bytes large). By fitting more objects into one cache line we gain more speed - we are definitely cache-bound here.

This only shows that it's important to consider the layout of your game object data in memory. The smaller they are, the more objects fit in a single cache line.

It also means that when your object is (roughly) the same size as your cache line size, it does not really matter anymore where it's located, you will have to fetch it from memory every time anyway and there's nothing else that can come along.

I thought this was cool ...