22 October 2018

Smart (shared) pointers or dumb pointers?

A few days ago someone asked me why I was teaching students the use of smart pointers. The code of the students was considered "bad" because they had used smart pointers. In that same week, other students asked me what they should use: smart pointers or raw pointers.

I defended my point of view on smart pointers, but my modus vivendi is to always question yourself. So am I right? Are smart pointers a good tool in our toolbox as game programmers or should they be avoided at all times? When we see them, should we run away in terror or should we consider it good design (if well used)?

The community's opinion

There's definitely been a shift in the mindset of the gamedev community. Googling for "smart pointers in game development" brought me upon several threads at gamedev.net on the topic, in chronological order:

In the 2012 thread, it's clear that C++11 is just new and people are still questioning the standard. But in the 2016 thread people actually defend the standard.

The standard

Recently there's something known as the "C++ Core Guidelines", a collection of code guidelines for C++ written by people that are knowledgeable about C++, edited by Herb Sutter and Bjarne Stroustrup. When we look at the chapter about resource management, we see that the use of smart pointers is encouraged.

  • R.20: "Use unique_ptr or shared_ptr to represent ownership". Ok that's clear, when we want to think about ownership, use those pointer types.
  • R.10 and R.11 are clear: avoid using malloc/free and new/delete. Why? Because we all know that we should use RAII, and memory is a resource, so it should be wrapped in a RAII wrapper. Enter the smart pointers.
  • R.3: "A raw pointer (a T*) is non-owning"

Why is it up for discussion?

In game development we deeply care for memory access patterns because it can heavily impact the performance of our algorithms, as I have shown in this previous blog post. The fear of smart pointers is mostly about the shared_ptr, because that one needs to keep a reference counter in memory somewhere. When an extra shared pointer to a given object is created, the reference counter should be increased (and when the pointer is released, decreased) which causes a memory access we don't want. Indeed, from Scott Meyers' Effective Modern C++:

  1. std::shared_ptrs are twice the size of a raw pointer.
  2. Memory for the reference count must be dynamically allocated.
  3. Increments and decrements of the reference count must be atomic.

But, as Scott Meyers points out, most of the time we use move construction when creating a shared pointer (a c++11 feature), thus removing overhead 3. Creating the control block can be considered free as long as you use "make_shared". Dereferencing a shared_ptr is the same as dereferencing a raw pointer (so use that).

The exact ins and outs of smart (shared) pointers and the possible performance impact they have is discussed in this very detailed talk on smart pointers by Nicolai M. Josuttis at NDC 2015. He describes in detail what exactly the cost is. There is a memory overhead of 12-20 bytes, depending on the usage, and in multi threaded applications there is an overhead in updating the reference counter. Updating the reference counter must happen atomic as Scott Meyers writes, but that can introduce stalls in the CPU's store buffer. Nicolai illustrates this in his talk and the impact is astonishing. However as long as you don't copy the smart pointers by value, there is no noteworthy cost.

In GotW #91 Herb Sutter gives these two guidelines:

  • Guideline: Don’t pass a smart pointer as a function parameter unless you want to use or manipulate the smart pointer itself, such as to share or transfer ownership.
  • Guideline: Prefer passing objects by value, *, or &, not by smart pointer.

This translates in the Core guidelines as

  • R.30: Take smart pointers as parameters only to explicitly express lifetime semantics

My conclusion

Is often this one: use the right tool for the right job (and know how to use it!). Indeed there are potential costs to std::shared_ptr, ones we don't like in game development. As seen in the tests, this happens most often when shared pointers are copied by value. That's why we teach our students to pass these by reference, just like strings. What I learned here is R.30, only pass these smart pointers when you're manipulating their lifetime. Raw pointers and references can and should be used when lifetime is not an issue.

If we're not working on the hot code path, we want the safety and correctness these smart pointers give. And yet again: profile, before you optimize. Be sure that there is a performance impact before you start to remove features in favor of "optimization".

Know the difference between the various types of pointers and use them for their intended purpose. I hope this post gives you some links to resources that help you with just that.

04 August 2018

More foggy adventures

It seems I was too eager with my stylized fog effect in my previous post. When I added a fly-through script on my camera (which I hadn't done before writing the previous post) it quickly showed something was wrong:

As you can see in the above image, the gradient moves on the terrain as you turn around. This is a typical side-effect of a depth based fog (which is standard).


Flat depth vs distance

Normally with a single color and a good fog distance this isn't too bad, but because we introduced the gradient this side-effect becomes really apparent. In a side scrolling context this is not much of an issue, since you don't turn around. But often you're turning your head so we desire distance based fog. The above image and the ins and outs of fog come from this excellent tutorial so check that out for more details.

I got a lot of inspiration from the now deprecated "Global Fog" post process effect from Unity. Both that script and the tutorial from catlike coding explain how to implement distance based fog. So we implement this with the PostFX v2 system. First, pass the frustum corners to the shader:

public override void Render(PostProcessRenderContext context)
{
 //...
 
 Camera cam = context.camera;
 Transform camtr = cam.transform;

 Vector3[] frustumCorners = new Vector3[4];
 cam.CalculateFrustumCorners(new Rect(0, 0, 1, 1), 
  cam.farClipPlane, cam.stereoActiveEye, frustumCorners);

 Matrix4x4 frustumVectorsArray = Matrix4x4.identity;
 frustumVectorsArray.SetRow(0, frustumCorners[0]);
 frustumVectorsArray.SetRow(1, frustumCorners[3]);
 frustumVectorsArray.SetRow(2, frustumCorners[1]);
 frustumVectorsArray.SetRow(3, frustumCorners[2]);

 sheet.properties.SetMatrix("_FrustumCorners", frustumVectorsArray);
 sheet.properties.SetVector("_CameraWS", camtr.position);

 //...
}

In the vertex program select the correct corner to have it interpolated:

struct v2f
{
 float4 vertex : SV_POSITION;
 float2 texcoord : TEXCOORD0;
 float2 texcoordStereo : TEXCOORD1;
 float4 ray : TEXCOORD2;
};

v2f Vert(AttributesDefault v)
{
 v2f o;
 
 // ...
 
 i.ray = _FrustumCorners[o.texcoord.x + 2 * o.texcoord.y];
 
 return o;
}

And then use that in the fragment shader:

float4 Frag(v2f i) : SV_Target
{
 half4 color = SAMPLE_TEXTURE2D(_MainTex, sampler_MainTex, i.texcoordStereo);
 float depth = SAMPLE_DEPTH_TEXTURE(_CameraDepthTexture, sampler_CameraDepthTexture, i.texcoordStereo);
 depth = Linear01Depth(depth);

 //float dist = ComputeFogDistance(depth);
 float dist = length(depth * i.ray);

 half fog = 1.0 - ComputeFog(dist);
 half gradientSample = 1.0 - ComputeFog(dist * _Spread);
 half4 fogColor = SAMPLE_TEXTURE2D(_FogGradient, sampler_FogGradient, gradientSample);
 return lerp(color, fogColor, fog * fogColor.a);
}

Easy enough, right? Or so I thought. This did not work at all! For some reason the interpolated rays were incorrect in the fragment shader. I spent the rest of the day with debug rendering, comparing results between the GlobalFog effect and mine, but I failed to find why interpolation seemed broken.

After a good night's sleep (solutions are always found after a good night's sleep) I decided to dig into the source code of the PostFX system. It never occurred to me before that at the end of the Render call in my effect it says "BlitFullscreenTriangle", where in all other legacy post fx examples it says "Blit". In the source code it literally says:

// Use a custom blit method to draw a fullscreen triangle instead of a fullscreen quad
// https://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/

Right, ok, that explains a lot, we're not interpolating a quad but a triangle that covers the entire viewport, which is apparently more cache friendly and thus faster. The coordinates look like this:

Where we used to have four vertices between -1 and 1 on both axes we now have a triangle between -1 and 3. Thus we change the provided corners:

public override void Render(PostProcessRenderContext context)
{
 //...
 
 Camera cam = context.camera;
 Transform camtr = cam.transform;

 Vector3[] frustumCorners = new Vector3[4];
 cam.CalculateFrustumCorners(new Rect(0, 0, 1, 1), 
  cam.farClipPlane, cam.stereoActiveEye, frustumCorners);
 var bottomLeft = camtr.TransformVector(frustumCorners[1]);
 var topLeft = camtr.TransformVector(frustumCorners[0]);
 var bottomRight = camtr.TransformVector(frustumCorners[2]);

 Matrix4x4 frustumVectorsArray = Matrix4x4.identity;
 frustumVectorsArray.SetRow(0, bottomLeft);
 frustumVectorsArray.SetRow(1, bottomLeft + (bottomRight - bottomLeft) * 2);
 frustumVectorsArray.SetRow(2, bottomLeft + (topLeft - bottomLeft) * 2);

 sheet.properties.SetMatrix("_FrustumVectorsWS", frustumVectorsArray);

 //...
}

We select the correct corner via the vertex coordinates:

v2f Vert(AttributesDefault v)
{
 v2f o;
 //...
 int index = (o.texcoord.x / 2) + o.texcoord.y;
 o.ray = _FrustumVectorsWS[index];
 //...
 return o;
}

And done! When we now look around us the fog stays the same:

If you looked closely you've also seen some height fog in the gifs, I'm still working on that but expect another update soon on that topic :)

02 August 2018

Stylistic fog from Firewatch with Unity's PostFX v2

A friend of mine (Kenny Guillaume) asked me if it would be possible to implement a fog effect as in Firewatch:

The picture above is taken from this video of the GDC 2015 talk on the art of Firewatch, where they explain how they implemented it.

The effect is simple enough: apply fog as a post process effect and for each sample fetch the fog color from a gradient texture. I even copy pasted the gradients from the same video:

My first take on this was a MonoBehaviour where we apply this effect in the OnRenderImage override. While this yielded good results, this is not how things should be done nowadays.

No sir, now we have Unity's PostFX V2 which you can enable via the package manager. I've seen this new library at Unite Berlin and was really impressed by it. It is a well designed system if you ask me!

So the challenge was to incorporate this "stylistic fog" effect (as they called in the Firewatch video, I rather call it "Stylized Fog") into the PostFX V2 system. In a Post Process Profile the settings look like this:

I managed to get this result by consulting the other effects that are available on github (since PostFX v2 is completely open source) and this very nice tutorial on custom effects. Definitely check this manual on the new PostFX system too.

The cool part is that there are only three code files required: a shader, a script and an editor script - wonderful! They really made it super user-friendly to extend their system with custom effects. This is a screenshot of my programmer-art terrain + Stylized Fog with the firewatch gradient applied:

Just imagine what an artist could do with this!

If you plan to use this, be aware that this effect replaces the regular fog in Unity. In other words you need to disable fog in the lighting settings:

If you enable fog in the lighting settings, the post process effect will be disabled and vice versa.

Be aware that, since this effect needs the depth buffer, you should not use MSAA, since the depth buffer will have no AA. Instead enable a screen space AA effect to fix this. This is the setting I used for the above screenshot:

Of course I added all this to my Unity Toolset repo, so have fun! I'm eager to receive any feedback on this!

[Edit] I put some extra work into this, as it turned out to be not completely ready, read about it here.

29 July 2018

Take a screenshot in Unity

I am working on a mobile app for the home poker tournament manager (HPTM) and while preparing the google play store for release I had to create some screenshots to display in the store. In the editor I can easily select the different resolutions and aspect ratios that are supported by the app, so I googled for the best way to take a screenshot from within the editor. It surprised me how many different solutions exist for this mundane task.

A lot of people write scripts that take a camera in the scene and render a screenshot from that. There are basically three methods that these scripts can use.

  • Texture2D.ReadPixels can even capture a part of the screen.
  • A slight different way is using a RenderTexture but that is almost the same as using ReadPixels, since at some point you'll need that call anyway. Instead of using the default rendertarget you're reading from a custom render texture.
  • ScreenCapture.CaptureScreenshot which replaces the deprecated Application.CaptureScreenshot. The main advantage is that this method also captures the UI that is rendered as "Screenspace - Overlay". The HPTM is almost entirely rendered in overlay, so this proved to be the only viable option.

A possible implementation of all three these methods can be found here.

I read some post where a user complained that this should be a default thing to do in the editor, supported by Unity itself. But reading all the different posts and comments on the forums and answers, it seems to me that there are a lot of different use cases, making it impossible for Unity to come with a standard default way to capture screenshots that would be useful to all developers (and simple to configure and use).

There are also some plugins available on the asset store that promise to be awesome tools. I tried only this one and it did not fit my needs. Perhaps there are paid solutions that cover it all, but since it is so easy to write it yourself, why bother?

Another "disadvantage" of all these scripts is that you need to attach them to some gameobject in your scene, while all I want is some editor code that has no impact on the game at all. It shouldn't even be compiled together with the game code! So I quickly wrote my own version of a screen capture script, adding yet another one to all the others than can be found on the internet :)

I added it to my UnityTools repo. Main advantages of this script? It uses the ScreenCapture functions, generates filenames with increasing numbers and a very, very, simple UI.

Always happy to receive feedback!

10 July 2018

Bend the world with Unity's Shader Graph

Hurray! Unity 2018.2 is out and that means: vertex positions are now accessible in graph shaders!

This means that the world bend shader from this previous post can now be implemented as a node! The cool part is: I only added one .cs file containing a custom node and it works with the existing code.

Add the "World Bend" node into the graph, and define 3 properties "Horizon", "Spread" and "Attenuation" (check my previous post to see what they are for). Connect the output as input for the position of your Master Node. Do not expose the three properties, but give them these reference values respectively: "_HORIZON", "_SPREAD", "_ATTENUATE". That way they receive the global values set by the World Bender script on the camera. And that's it! There is only one extra file:

using System.Reflection;
using UnityEditor.ShaderGraph;
using UnityEngine;

[Title("Custom", "World Bend")]
public class BendNode : CodeFunctionNode
{
 public BendNode()
 {
  name = "World Bend";
 }

 protected override MethodInfo GetFunctionToConvert()
 {
  return GetType().GetMethod("Bend",
   BindingFlags.Static | BindingFlags.NonPublic);
 }

 static string Bend(
  [Slot(0, Binding.ObjectSpacePosition)] Vector3 v,
  [Slot(1, Binding.None)] Vector1 Horizon,
  [Slot(2, Binding.None)] Vector1 Spread,
  [Slot(3, Binding.None)] Vector1 Attenuate,
  [Slot(4, Binding.None)] out Vector3 Out)
 {
  Out = Vector3.zero;
  return @"
{
 Out = mul (unity_ObjectToWorld, v);
 float dist = max(0.0, abs(Horizon.x - Out.z) - Spread);
 Out.y -= dist * dist * Attenuate;
 Out = mul(unity_WorldToObject, Out);
}";
 }
}

As always, you can find a small demo on my unity toolset repo. I'm open for feedback!

06 April 2018

Trashing the cache: C++ vs C#

In my previous post I showed how data locality can improve your program's performance. Since then I have been wondering whether these same optimizations are valid in C# too. (If you haven't read the previous post you'd better do that first, or this won't make much sense)

Integer array

Thus again, we loop over an array of integers with increasing step size, inspired by this blogpost. The goal is to illustrate the impact of cache-fetches, where we need to fetch data from memory into the cache instead of having a cache-hit. In C# we have almost identical code to the C++ version:

const int length = 64 * 1024 * 1024;
var arr = new int[length];
for (int k = 1; k <= 1024; k = k * 2)
{
  for (int i = 0; i < length; i += k)
  arr[i] = i;
}

We measure how long it takes to go through the array and perform an operation on each value, with an increasing step size of powers of two. That gave me these results:

That's quasi identical to the C++ version:

In other words, we're just as memory-fetch-time bound in C# as we are in C++. This should not come as a surprise.

GameObjects

We create a similar GameObject class as in the C++ version:

class GameObject3D
{
  public float[] transform = {
    1,0,0,0,
    0,1,0,0,
    0,0,1,0,
    0,0,0,1
  };
  public int ID;
}

Here we stumble upon a difference between C++ and C#: creating an array of these objects in C++ allocates the memory and initializes it with the default constructor on every instance. In C# we must write this ourselves:

GameObject3D* arr = new GameObject3D[length];

vs

GameObject3D[] arr = new GameObject3D[length];
for (int i = 0; i < length; i++)
  arr[i] = new GameObject3D();

The timings are very similar. Again, we are bound by cache speed:

Here's my alternative again:

class Transform
{
  public float[] values = {
    1,0,0,0,
    0,1,0,0,
    0,0,1,0,
    0,0,0,1
  };
}

class GameObject3DAlt
{
  public Transform transform;
  public int ID;
};

Which yields this graphic:

Wait what? What happens here? The chart has a different shape than the C++ version and is almost twice as high? C# clearly introduces some overhead here. The talk of Joachim Ante on the new Entity-Component system of Unity gave me the idea to try a struct version of our classes; because we are creating the game object instances one by one they're not guaranteed to be located next to each other in memory. So I tried this instead:

struct SGameObject3DAlt
{
  public Transform transform;
  public int ID;
};

Unlike arrays of classes, arrays of structs are co-allocated in memory and initialized, causing the same speeds as we had in C++ (even a little bit faster!):

Ok, that's it. We re-created all tests in C# and we achieved quasi identical results. In conclusion: C# is just as memory-fetch-bound as C++, and in C# you carefully need to think whether or not you want to use structs instead of classes. But the fact that we achieve at least the same timings also shows that C# is as fast as C++ in these rather contrived tests.

24 February 2018

Trashing the cache

Recently I switched jobs. I now pretend to know a thing or two about game development as a lecturer at Howest DAE.

One of the courses I'm teaching has the beautiful name "Programming 4". The topics are software design patterns, memory management, threading and networking.

The first week we talked about the cache and the CPU pipeline. I have always known about these things, but it wasn't until I had to explain the concepts to students that I really took an in-depth look at it.

Integer array

When memory is used in your code, it gets fetched from main RAM memory into the cache of your CPU. When you need a single integer it wont go fetching that single value, instead it will fetch an entire cache line, containing - among other things - the integer you requested. If the data next to the integer is used next by your program, it won't have to be fetched anymore, you can use it immediately. That's a cache-hit. Instead when the data is not there and you need another fetch from memory, that's a cache miss. Check out the data locality chapter from Game Programming Patterns by Robert Nystrom for more details.

The first assignment was to loop over an array of integers with increasing step size, inspired by this blogpost. The goal is to illustrate the impact of cache-misses, where we need to fetch data from memory into the cache instead of having a cache-hit, where we use the data already in the cache.

Consider this loop:

const unsigned long length = 64 * 1024 * 1024;
const auto arr = new int[length];
for (auto k = 1; k <= 1024; k = k * 2)
{
 for (auto i = 0; i < length; i += k)
  arr[i] = i;
}

We measure how long it takes to go through the array and perform an operation on each value, with an increasing step size of powers of two. That gave me these results:

The vertical axis is the time it took to go through the array, the horizontal axis is the step value. The red line is what we would expect: every time we multiply the step size with two. Thus, every step we only do only half as much multiplications as before, we would expect the time to be cut in half too. But this is not the case.

Instead, we decline to some kind of minimum value, and only once the step size is over 16 we start to cut the time in half with every step. What happens?

My cache line size is 64 bytes, so we can fit 16 integers in one cache line. When we loop over the array 16 integers will be loaded at once in the cache. Consider this image:

With step size one, we use every integer in the cache line. With step size 2 we use every other integer, step size 4 every fourth integer and so on. With step size 32 however we skip an entire cache line fetch that we needed before, with step size 64 we'll skip three cache fetches etc. It's only when we use fewer catch fetches that our timings lower!

In other words, although we did less and less computations with every step size, we could not get faster than the time it takes to fetch the array from memory into cache. We were "cache-bound".

GameObjects

So instead of integers, let's have an array of these GameObjects and perform an operation on the ID field:

struct Transform
{
 float matrix[16] = { 
  1,0,0,0,
  0,1,0,0,
  0,0,1,0,
  0,0,0,1 };
};

class GameObject3D
{
public:
 Transform transform;
 int ID;
};

That results in this graphic:

This time it almost follows the expected line. It confused the students: why is there no bump now? The reason is this: the GameObject3D is 68 bytes large - we need to fetch at least two cache lines from memory to get a single GameObject3D in the cache. This means that when we process half the number of GameObjects, we also need half the number of cache lines filled, thus it will take half as much time.

Then consider this alternative:

class GameObject3DAlt
{
public:
 Transform * transform;
 int ID;
};

We now have a pointer to the transform instead of the entire struct. This results in this graphic:

Ah, there is our bump again! But isn't this worse than the previous one? Not really - compare them:

The alternative game object is processed way faster now, because we only need one memory fetch to process 8 game objects (the object is 8 bytes large). By fitting more objects into one cache line we gain more speed - we are definitely cache-bound here.

This only shows that it's important to consider the layout of your game object data in memory. The smaller they are, the more objects fit in a single cache line.

It also means that when your object is (roughly) the same size as your cache line size, it does not really matter anymore where it's located, you will have to fetch it from memory every time anyway and there's nothing else that can come along.

I thought this was cool ...

07 January 2018

Home Poker Tournament Manager

Again I participated in the Finally Finish Something Jam, as I did last year with Hopper.

This year I re-made the pokerroom home game organizer, which was a very nice and beautiful program to host a poker tournament at home with. The visuals where nice, the setup was easy. This is how it looked:

It had only two flaws: it was made in Flash and it only worked on a 800x600 screen resolution. I did use it for years however, every time setting my screen resolution to 800x600 during the tournament. Annoying if you had to do anything else on the pc in the meantime.

Last summer I decided to start on a remake with Unity, and today I finished it, kicked in the butt by the Finally Finish Something Jam 2018.

This time I made the UI with the Kenney Game Assets, a very nice comprehensive library of game assets, the backgrounds I just googled, the same for the sounds.

So with a little bit of pride I announce the Home Poker Tournament Manager, let me know what you think!

HPTM in action: