Quaternions & Gimbal Lock

In any software involving 3D-space computations, two of the representations for rotations which pop up quite commonly are Euler angles and quaternions. Euler angles are intuitive, they provide us with an angle of rotation around each axis (x, y, z), and all possible rotations can be realized by them. The exact final rotation will depend on what the software or math library does with the Euler angles, namely, what order the rotations are applied in, and whether they are intrinsic (each angle rotates around the local axis) or extrinsic (each angle rotates around a fixed axis). Due to how intuitive they are, most people like to work with this representation for rotations, however those who do might end up encountering the phenomena of gimbal lock.

Gimbal lock is what happens when you lose a “degree of freedom”. So what does that mean exactly? Lets look at an situation that we can all relate to. Take the human body for example, if you were to stand up and have your head in its natural position, we can consider this the “unrotated” state. Now, one way to look to your right or left would be to turn your head around to the right/left, and this would be considered a rotation around the “up” axis. However for the purpose of this example, let us consider doing the same rotation by using our legs to turn on the spot. Now, another degree of freedom would be looking up or down, as a continuous movement this would just be tilting your head up and down, and can be considered as a rotation around the “right” axis. The last degree of freedom would be to tilt your head to the left or right, and this would be considered a rotation around the “forward axis” as it is a rotation around the axis you are looking down. Notice that in the unrotated state, each of the 3 actions (rotate on the spot with your legs, nod up or down, tilt left or right) have a distinct effect on your orientation.

Now from an unrotated state, look up at the sky, aka tilt your head straight up, and now try any of the two other actions (rotate with your legs, or tilt your head left/right). You should notice that whichever you choose to do, you end up doing the same rotation. Thus regardless of whether you try to rotate around the up axis or around the forward axis, the result is the same, and this is what we call gimbal lock. As mentioned before, for Euler angles, the rotations around the 3 axis are carried out in a specific order, and for gimbal lock to occur, we need the rotation that will occur 2nd to be either 90 or -90 degrees (or -270/270 equivalently). As it turns out, our body naturally follows the order of up axis then right axis then forward axis, and so the right axis in this case is the axis that can cause gimbal lock, which is exactly what we observed (tilting our head straight up is a 90 degrees rotation around our right axis).

Quick aside: The common rotation order game engines use (at least Unity and UE4) is precisely the order above. You might have heard about yaw/pitch/roll, yaw is the angle of rotation around the up axis, pitch is the angle of rotation around the right axis, and roll is the angle of rotation around the forward axis. Most software (but not all) will use the order yaw/pitch/roll, and this is usually intuitive to users because it is how our body behaves, and so using this sequence, gimbal lock will occur in a situation we are unlikely to encounter (the ideal sequence for a given use case, is the sequence in which the axis least likely to reach 90 degrees rotation is placed in the middle).

Gimbal lock is not in itself always an issue however. The cause of gimbal lock (that we apply three rotations in sequence) sometimes makes physical sense. Think of a 1st person camera in an game, the aim of the camera is to reasonably simulate how our vision would behave if we were the character, and thus the constraints that led to gimbal lock arent things we should throw out because the goal of the camera is to mimic the human experience. On the other hand, if we were building a game with a spaceship, and the spaceship had thrusters attached to it that enabled rotation, then you would never want to lose any degrees of freedom, and so backing up such a system with Euler angles is likely going to result in trouble.

When people typically encounter gimbal lock, they will google the issue, and common advice will be to use quaternions because “they do not suffer from gimbal lock”. While this advice is technically true, it is also misleading. In fact, a game engine like Unity is internally storing a quaternion to represent the rotation of a game object, and yet it can gimbal lock. To see this, simply set the x rotation of a game object to 90 degrees and you will see that changing the value of the y or z rotation will have the same effect. This is because while the data structure storing the rotation is a quaternion, the quaternion is still being built by applying a rotation around the three angles in a sequence. A good resource for building quaternions with different rotation sequences is: https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19770019231.pdf

Similarly, other naive use cases might be something like the following code:

This code is “using” quaternions, however it is still doing nothing different than how the engine internally builds the final rotation quaternion from the Euler angles, and thus this also suffers from gimbal lock. The fact is that gimbal lock occurs due to applying three rotations in sequence where the 2nd rotation can make the 3rd rotation redundant with the first. It doesnt matter whether you use Euler angles, quaternions or rotation matrices to do this, the result is the same.

So then why do we only say Euler angles suffer from gimbal lock, and say that quaternions do not? Well, using the sequence yaw/pitch/roll, we showed we have gimbal lock when the pitch is 90 degrees. Now modifying the yaw and roll have the same effect contrary to what we want. Let us say you wish to rotate around the current up axis, however modifying yaw is rotating around the current forward axis (as is modifying roll, which is to be expected). So we start with this:

And want to rotate around the green axis, well it turns out to rotate 45 degrees around the current up axis, we require the following Euler angles:

First thing to note is, to break out of gimbal lock and get a rotation around the up axis, we had to modify the rotation around the right axis (the only way to break out of gimbal lock is to modify the value of the axis causing the gimbal lock). Another thing to note is suddenly we have -90 for our yaw and roll. Thats interesting… what if we made yaw and roll -90 in the first rotation where we were gimbal locked?

We find that <90, 0, 0> is the same rotation as <90, -90, -90>. You might now say, well, if <90, 0, 0> and <90, -90, -90> are the same rotation, and changing x to 135 gave us the rotation we wanted, what about <135, 0, 0> ?

That is not at all what we wanted, our x axis hasnt moved at all when our goal was to tilt it up, however obviously the x axis wasnt going to move because all we have done is increase the rotation around x.

So here is the crux of the issue, when we are gimbal locked, there are an infinite number of Euler angles which can represent our current rotation. In our case, when pitch is 90 degrees, yaw and roll simply rotate in opposite directions to each other, so as long as yaw and roll are equivalent, we have the same rotation. That is to say that <90, 0, 0> is the same rotation as <90, 10, 10> or <90, -80, -80> or <90, 7, 7> etc. However, a modification to the pitch will not behave the same way for all of these. Lets say that we will apply the operation of subtracting 5 from the pitch, we find that <85, 0, 0> does not give the same rotation as <85, 10, 10> or <85, -80, -80> or <85, 7, 7> in fact each of those rotations are different. This is a problem however… we identified that to break out of gimbal lock we must modify the pitch, however modifying the pitch has a different effect depending on which of the infinite Euler angles we had, even though they all represented the same rotation! This is a big red flag, our goal is to have a representation for rotations, however applying the same operation on the same rotation is giving different resulting rotations, clearly Euler angles are not perfect for representing rotations!

On the other hand, let us look at a quaternion representation for rotations. Let us look at the issue of degeneracy (where several Euler angles mapped to the same rotation):

When we run the code above in Unity, we get the following output:

And this is what we would like to have from whatever representation we use for rotations, that there is a 1:1 mapping between a rotation and representation. I will take a moment here to address a concern that readers might have with my statement about 1:1 mapping, which is that there are actually two quaternions that represent any given rotation. If you negate all signs of a quaternion, it becomes the same rotation:

In mathematical terms, this is called a “double cover” (There are 2 quaternions that represent a given rotation) and the reason why applying the rotation q4 or q5 have the same result has to do with how the multiplication operator works with quaternions. For a more in depth analysis of the multiplication operator/double cover/how complex numbers are used in quaternions, I recommend the following post: http://www.reedbeta.com/blog/why-quaternions-double-cover/

The key thing however is that there is not an infinite number of quaternions to represent a single rotation (as is the case with Euler angles). Even more importantly, applying the same operator to q4 or q5 yields the same rotation:

Note q1/q2/q3 are the same quaternions we built earlier, and are all equivalent to q4, and q5 is the quaternion q4 with all its signs negated. Notice that when we apply the additionalRot quaternion to our five quaternions, the results are the same. The first four match identically because the original quaternions were identical, however the last print statement has the signs negated compared to the first four. However remember, we said that each rotation has two representations, each with the same numerical values but opposing signs, and we can clearly see the last print statement is the same as the previous four just negated, and thus represents the same rotation.

In the end we demonstrated that despite being at a rotation where Euler angles would suffer from gimbal lock, quaternions do not suffer from degeneracy at this rotation, and thus an operator on the quaternion group yields the desired rotation consistently. This consistent nature is what makes them an ideal representation for rotations (among other things like SLERP and compact representation, which is not the focus of this post).

It might be helpful to visualize the situation, so lets do that before concluding this post. We will collapse the situation down a dimension, so our Euler angles will have 2 components instead, and our quaternion will have 3 components instead (still normalized though). Notice that two angles naturally parameterize a Torus (each angle is an angle of rotation in a circle):

Image12 - Source Wikipedia

Also notice that a vector of three components that has a magnitude of one (is normalized) naturally parameterizes the surface of a unit sphere (a sphere with radius of one).

Now we would like to look at the mapping from the torus to the surface of a sphere. A way to go from two angles to a point on the surface of a sphere is to treat the two angles as spherical coordinates and use the following equations:

Let us treat theta as the angle for the smaller circle of the torus, and phi as the angle for the larger circle. Consider then the following two green points on the torus:

These two points have the same theta angle of 0 (the image may not be perfectly accurate, but pretend it is!), but have significantly different phi angles. However, if we take a theta of 0 and plug it into the spherical coordinates, the sine term in x and y reduce to 0, and we are left with <0, 0, 1> regardless of the phi value, which means both of these green points map to the same point on the surface of a sphere.

Now say on the sphere we wish to move from the point <0, 0, 1> to <0.14, 0, 0.99> , the spherical coordinates for the latter point are given by:

Where r = 1, and so theta is ~8 degrees and phi is 0. On a sphere we might see the new position (red dot) as in the below image

And on our torus we would see something similar to the following:

The distance to the red dot is not the same from the two green points on the torus, however on the sphere there is a single distance. This mirrors the same issue earlier of several Euler angles representing the same rotation and thus the same operation on these Euler angles does not yield the same result (despite representing the same rotation, aka point on the sphere). In fact, our 2D situation is analogous to the 3D situation. Euler angles are just three angles instead of two, and a quaternion represents a point on the surface of a unit radius 4D hyper-sphere ( $S^{3}$ ) and so the visual situation described above happens the same when we go a dimension higher.

Using topological algebra to look at this more formally, we can look at the topology of the torus ( $T^{2}$ ) and the unit sphere surface ( $S^{2}$ ) and notice that the points in the local neighborhood of a source point on the torus does not always map to the local neighborhood of that source point mapped onto the sphere (Theta = 0, Phi = 210 is not near Theta = 8, Phi = 0, despite these two points mapping very close to each other when transformed to Cartesian coordinates). This and the fact that a single point on the sphere can map to several points on the torus (thus the mapping from spherical coordinates to Cartesian coordinates is not a bijection) means the torus and the surface of a sphere are not homeomorphic. Again, if we go up a dimension, three angles naturally parameterize the group $T^{3}$ (the three-torus), and a 4D unit vector naturally parameterize the group $S^{3}$ (the surface of a unit 4D sphere). The rotation group (SO3) is homeomorphic to $S^{3}$ , which is why quaternions naturally parameterize the rotation group, however because $T^{3}$ is not homeomorphic to $S^{3}$ (and thus in turn the rotation group), the Euler angles do not naturally parameterize the rotation group, and we run into the aforementioned issues.

COM Tutorial

Over the course of writing code involving DirectX and the Windows API, I frequently ran into COM objects. After working with them for a while, I am considering writing objects in my DX12 abstraction layer following the design of COM. I figured before I commit to that choice, that it might be good to first write a post about COM so that I can clarify my understanding and also to refer back to it later if something slips my mind. Hopefully this serves to help any potential readers gain insight into how to work with and build COM objects.

First, COM stands for Component Object Model and it is referred to as a binary standard. This is because COM is a standard and does not depend on structure or language, nor does it specify the details of implementation. All these aspects are left to the user, with the only requirement that the standard be followed. Fundamentally, COM objects act as an object which serves interface pointers to an entity that makes requests for these interfaces. An interface can be thought of as a set of related functions, where each function is a method of the interface. A COM object has the ability to serve multiple interfaces, and thus the interface returned depends on the interface requested. This whole paradigm of a COM object serving requests can be thought of as a server-client model. The client requests a pointer to an interface from the COM object which acts as the server. In fact, it is possible to have these requests serviced over a network like a traditional client-server as there is no requirement that requests come from the same process or even the same machine. For a rigorous definition of the client-server aspect of COM, you may refer to: https://docs.microsoft.com/en-us/windows/desktop/com/com-clients-and-servers

I will use the terms COM object and server interchangeably in the post depending on the point I am attempting to make. That being said, since the programming I do involves working with DirectX COM objects which are fundamentally in-process servers, I have not delved into COM over a network. As such, this tutorial will only cover using COM objects within the same process. For a more detailed exploration of COM objects, I recommend Microsofts documentation: https://docs.microsoft.com/en-us/windows/desktop/com/the-component-object-model

The IUnknown Interface

As COM objects involve returning pointers to interfaces, you can imagine that there are actually interfaces involved in the code, and indeed there will be composition and aggregation. The first interface you will want to understand when working with COM is the base interface everything will inherit from; IUnknown. The declaration and definition for IUnknown can be found in Unknwnbase.h inside the WindowsKits. It looks as follows:

	MIDL_INTERFACE("00000000-0000-0000-C000-000000000046")
	IUnknown
	{
	public:
	BEGIN_INTERFACE
	virtual HRESULT STDMETHODCALLTYPE QueryInterface(
	/* [in] */ REFIID riid,
	/* [iid_is][out] / _COM_Outptr_ void __RPC_FAR __RPC_FAR *ppvObject) = 0;

	virtual ULONG STDMETHODCALLTYPE AddRef( void) = 0;

	virtual ULONG STDMETHODCALLTYPE Release( void) = 0;

	template<class Q>
	HRESULT
	#ifdef _M_CEE_PURE
	__clrcall
	#else
	STDMETHODCALLTYPE
	#endif
	QueryInterface(_COM_Outptr_ Q** pp)
	{
	return QueryInterface(__uuidof(Q), (void **)pp);
	}

	END_INTERFACE
	};

view raw

COMTutorial1.h

hosted with ❤ by GitHub

Lets look at what we have here. First we see a macro called “MIDL_INTERFACE”. This macro expands to the following:

#define MIDL_INTERFACE(x) struct __declspec(uuid(x)) __declspec(novtable)

view raw

COMTutorial2.h

hosted with ❤ by GitHub

It states that we are about to declare a struct, and then has two specification declarations. __declspec lets us store Microsoft specific attributes, for a full explanation and list see: https://docs.microsoft.com/en-us/cpp/cpp/declspec?view=vs-2017 uuid(x) states that the argument for the macro is to be the uuid of this struct. Uuid stands for universally unique identifier, and every single interface defined with the intent of using it as part of a COM object requires its own uuid (we will look at generating our own uuids shortly). __declspec(novtable) simply says do not generate a vtable for this struct because it is a pure interface and will never be instantiated on its own.

There are three pure virtual functions that are public and the definition for a single helper function.

Query Interface

The first of the pure virtual functions is the QueryInterface function. QueryInterface is effectively a function that a client can call to request a pointer to an interface. It checks first if the COM object is able to serve such an interface, and if so will provide a pointer to it, otherwise a null pointer will be provided and a E_NOINTERFACE error will be returned. You can think of this like a COM object holding pointers to several vtables, where each interface has its own vtable, and the COM object can serve many different interfaces.

The first argument to QueryInterface is REFIID which is just a macro for a const uuid reference. This will be used to determine which interface the client is requested, and ensure that we provide the correct interface pointer if possible. The second argument is a pointer to a pointer, and this is where we QueryInterface will provide the pointer to the interface. As such, when the function is done, ppvObject will be a pointer to the requested interface pointer.

Reference Counting

COM objects implement their own reference counting. This is because a client is incapable of determining the life time of a server, it has no knowledge of whether other clients are using this server or not. As such, the server keeps track of how many clients it is servicing. To do this, whenever a client requests an interface pointer from the COM object, AddRef is called. Similarly, whenever a client is finished with the interface it requested, Release is called. An important thing to note then is that the reference counting is not specific to an interface, but is for the entire COM object. If a COM object can serve 10 different interfaces, the moment a client requests one of them, the entire COM object is alive. If a different client comes along and requests some other interface, there would now be 2 references to this server, and the server will not shut down until both clients are done with the interfaces they requested. The moment the last client releases the interface it requested, the server will shutdown. In the case of our COM object, this will result in the object deleting itself.

The last QueryInterface is just an overload that acts as a helper function. Instead of a uuid and a pointer to a void pointer, you can just pass in a pointer to a pointer of the interface you want, and the template function will get the uuid for you automatically.

Building Our Own COM Object

Armed with the above knowledge and an understanding of IUnknown, we are now in a position to try building our own COM object. For the purpose of this tutorial we will build a windows console application. We will write a COM object that serves an interface to a basic text printer(will literally just print out a given string), and a basic calculator.

Definitions:

We will start with header file to declare all the interfaces for our COM object. The header file will be called “COMServer.h” and the first thing we will put in is:

	#pragma once

	#include <Unknwnbase.h>

	MIDL_INTERFACE("4eb23a5f-8445-4963-98d3-2e1e1ca670fa")
	ICalculator : public IUnknown
	{
	public:
	virtual double STDMETHODCALLTYPE Add(const float& v1, const float& v2) = 0;
	virtual double STDMETHODCALLTYPE Subtract(const float& v1, const float& v2) = 0;
	};

view raw

COMTutorial3.h

hosted with ❤ by GitHub

We include <Unknwnbase.h> to get the declaration for the IUnknown interface, and then we define the interface ICalculator which inherits from IUnknown. Now at this point you might be wondering where I got the uuid from. It is possible to generate your own uuid very easily. If you navigate to your bin folder of your Windows Kits directory, you will find an exe called uuidgen.exe, for me the directory was: C:\Program Files (x86)\Windows Kits\10\bin\10.0.17763.0\x64 And the command I ran was “uuidgen.exe -i -oD:/MyApp.idl” which effectively told the exe to generate an interface file (-i) and output it (-o) in my D drive named MyApp.idl inside the file you will find the uuid that you generated. Use “uuidgen.exe -h” to get a list of all the possible commands.

In the same manner, we will define an IPrinter interface:

	MIDL_INTERFACE("0ed09391-f034-4efe-9498-cf698932fc04")
	IPrinter : public IUnknown
	{
	public:
	virtual void STDMETHODCALLTYPE Print(LPCSTR str) = 0;
	};

view raw

COMTutorial4.h

hosted with ❤ by GitHub

Now we are ready to declare and define the server class itself. Notice that so far, all we have done is create interfaces, there is as of yet zero implementation for anything yet. This is where the server class comes in. Typically the server class will be hidden from the client, such that the client will have no notion of its existence, the only API exposed to the client would be the interfaces. For this tutorial, as all the source code is in a single project, the server class will be exposed to the client, but in practice the server class will be compiled into a DLL and the header which exposes the API of this DLL will not have anything pertaining to the server class. As I do not want to distract from the focus of COM, making a DLL will not be part of this tutorial, but it I mention it for clarity.

We will call the internal class “InternalServer”. It will inherit the ICalculator and IPrinter interfaces and look as follows:

	class InternalServer : public ICalculator, public IPrinter
	{
	public:
	//IUnknown Interface Methods
	virtual HRESULT STDMETHODCALLTYPE QueryInterface(REFIID riid, void** ppvObject) override;
	virtual ULONG STDMETHODCALLTYPE AddRef() override;
	virtual ULONG STDMETHODCALLTYPE Release() override;

	//ICalculator Interface Methods
	virtual double STDMETHODCALLTYPE Add(const float& v1, const float& v2) override;
	virtual double STDMETHODCALLTYPE Subtract(const float& v1, const float& v2) override;

	//IPrinter Interface Methods
	virtual void STDMETHODCALLTYPE Print(LPCSTR str) override;

	private:
	InternalServer();
	~InternalServer();

	friend HRESULT STDMETHODCALLTYPE CreateCalculatorComponent(REFIID riid, void** ppvObject);
	friend HRESULT STDMETHODCALLTYPE CreatePrinterComponent(REFIID riid, void** ppvObject);

	private:
	ULONG m_RefCount;
	};

view raw

COMTutorial5.h

hosted with ❤ by GitHub

As InternalServer will hold the implementation of all the interfaces, we must have a matching virtual override for every interface function from which we inherit. Keep in mind that ICalculator and IPrinter both inherit from IUnknown, and this can cause inheritance ambiguity, but we will address this later. We have a private reference counter which will be used to manage the server’s lifetime. I have made the constructor a private function to make sure that nobody can create a server who isn’t authorized to make a server, as if someone else works on this code later, I do not want them mistakenly thinking that calling the constructor directly is a good idea. The server should only be instantiated via a call to a “Create” function. To that end, we have 2 Create functions declared as friend functions, so that they can access the constructor of the server. We will look more at these Create functions and discuss them later.

We will complete our header file for now by declaring the signatures for the two friend functions, which are fundamentally just free functions:

	HRESULT STDMETHODCALLTYPE CreateCalculatorComponent(REFIID riid, void** ppvObject);
	HRESULT STDMETHODCALLTYPE CreatePrinterComponent(REFIID riid, void** ppvObject);

view raw

COMTutorial6.h

hosted with ❤ by GitHub

Implementation:

We can now move onto writing the source file for this header, which has the implementation for everything we have defined so far.

Constructor & Destructor

For construction of our COM server, we just need to initialize the reference count to 0 (this will actually occur implicitly since the default constructor of an unsigned long constructs to 0, but is added for clarity). We also print out a debug statement so that we can see things in action:

	InternalServer::InternalServer() :
	m_RefCount(0)
	{
	std::cout << "Creating Server" << std::endl;
	}

	InternalServer::~InternalServer() {
	std::cout << "Destroying Server" << std::endl;
	}

view raw

COMTutorial12.h

hosted with ❤ by GitHub

QueryInterface Method

We start first with the QueryInterface method which looks as follows:

	HRESULT STDMETHODCALLTYPE InternalServer::QueryInterface(REFIID riid, void** ppvObject)
	{
	HRESULT result = S_OK;

	if (riid == __uuidof(IUnknown))
	{
	//reinterpret_cast will not work, as the pointer address does not change when using reinterpret_cast
	ppvObject = static_cast<IUnknown>(this);
	this->AddRef();
	}
	else if (riid == __uuidof(ICalculator))
	{
	ppvObject = static_cast<ICalculator>(this);
	this->AddRef();
	}
	else if (riid == __uuidof(IPrinter))
	{
	ppvObject = static_cast<IPrinter>(this);
	this->AddRef();
	}
	else
	{
	*ppvObject = nullptr;
	result = E_NOINTERFACE;
	}

	return result;
	}

view raw

COMTutorial7.h

hosted with ❤ by GitHub

This method is fairly straight forward, based on the riid, we figure out which interface the client is requesting, and then we provide a pointer to the requested interface in ppvObject. If the interface is not found, we return E_NOINTERFACE, otherwise S_OK. This is actually a really powerful feature, because as the method name implies, we allow a client to query whether the server has the interface. If the server cannot provide the interface, the situation is dealt with gracefully and the behavior is well defined, such that the client can handle it without resorting to exceptions. If the client receives S_OK, then it knows to proceed as planned otherwise it must handle the case of the server not being able to service the requested interface. Each time an interface is served by the server, the server increases its internal reference counter. As such, only when none of the server’s interfaces are being used by any client, does the server destroy itself. This should reinforce the fact that the reference count is global, and that a client holding a pointer to any one of the many possible interfaces is enough to keep the entire server alive.

Notice how we are leveraging multiple inheritance to elegantly implement this functionality. By typecasting the pointer to the server itself to the requested interface pointer, we essentially return the pointer to the vtable that has all the interface methods. There is however one thing we need to address.

If you are programming as you go along with the tutorial, you might have noticed an inheritance issue. The static_cast<IUnknown>(this) will not work here because of the error “base class IUnknown is ambiguous”. This is because InternalServer inherits from both ICalculator and IPrinter which both in turn inherit from IUnknown, and so we indirectly inherit from IUnknown twice. The way this is currently structured, there will actually be two instances of IUnknown inside the memory layout of InternalServer, and so it should be clear why this typecast is ambiguous. This is where virtual inheritance comes in. If virtual inheritance is something new to you and you would like to read up on it, I wrote another blog post on the topic which can be found at: https://jackmin.home.blog/2018/12/23/multiple-inheritance-virtual-inheritance-in-visual-c/

Thus the definitions for ICalculator and IPrinter will now virtually inherit from IUnknown like the following

	MIDL_INTERFACE("4eb23a5f-8445-4963-98d3-2e1e1ca670fa")
	ICalculator : public virtual IUnknown
	{
	public:
	virtual double STDMETHODCALLTYPE Add(const float& v1, const float& v2) = 0;
	virtual double STDMETHODCALLTYPE Subtract(const float& v1, const float& v2) = 0;
	};

	MIDL_INTERFACE("0ed09391-f034-4efe-9498-cf698932fc04")
	IPrinter : public virtual IUnknown
	{
	public:
	virtual void STDMETHODCALLTYPE Print(LPCWSTR str) = 0;
	};

view raw

COMTutorial8.h

hosted with ❤ by GitHub

AddRef/Release

The purpose of the addref and release functions are to control the lifetime of the COM server. The functions look like the following:

	ULONG STDMETHODCALLTYPE InternalServer::AddRef()
	{
	InterlockedIncrement(&m_RefCount);
	return m_RefCount;
	}

	ULONG STDMETHODCALLTYPE InternalServer::Release()
	{
	ULONG refCount = InterlockedDecrement(&m_RefCount);
	if (m_RefCount <= 0) {
	delete this;
	}
	return refCount;
	}

view raw

COMTutorial9.h

hosted with ❤ by GitHub

The key thing to note here is that when the reference count hits 0, the COM server will delete itself. This way clients do not need to concern themselves with lifetime issues, the server has full knowledge of when its services are requested, and when none of the services are needed anymore, the server will delete itself.

Server Service Functions

Now that all the IUnknown functions have been implemented, we need to implement the actual functions that the server can do as services. In this case we have Add/Subtract available from the Calculator interface, and Print from the Printer interface:

	double STDMETHODCALLTYPE InternalServer::Add(const float& v1, const float& v2) {
	return v1 + v2;
	}

	double STDMETHODCALLTYPE InternalServer::Subtract(const float& v1, const float& v2) {
	return v1 – v2;
	}

	void STDMETHODCALLTYPE InternalServer::Print(LPCSTR str) {
	std::cout << str << std::endl;
	}

view raw

COMTutorial10.h

hosted with ❤ by GitHub

Server Creation Functions

Remember above we had two friend functions that each created component and returned a pointer to it. We have a function for making a printer, and a function for making a calculator. This is where COM objects might get confusing. As should be clear by now, a calculator is not an independent object, but part of a larger server object, and the same for a printer. This means we cannot create one of the components in isolation, but the whole server when a component creation is requested. This looks like the following:

	HRESULT STDMETHODCALLTYPE CreateCalculatorComponent(REFIID riid, void** ppvObject) {
	if (riid != __uuidof(ICalculator)) {
	return E_NOINTERFACE;
	}
	InternalServer* pServer = new InternalServer();
	return pServer->QueryInterface(riid, ppvObject);
	}

	HRESULT STDMETHODCALLTYPE CreatePrinterComponent(REFIID riid, void** ppvObject) {
	if (riid != __uuidof(IPrinter)) {
	return E_NOINTERFACE;
	}
	InternalServer* pServer = new InternalServer();
	return pServer->QueryInterface(riid, ppvObject);
	}

view raw

COMTutorial11.h

hosted with ❤ by GitHub

Notice how both functions will issue a call to create a new InternalServer object. Remember that the strength of COM objects is that we can query interfaces. As such, client code would call either CreateCalculatorComponent OR CreatePrinterComponent based on whatever service it first needs. Then if later one of the other interfaces available from the COM server is required, we call QueryInterface on the pointer to the current interface. That is, if I want a pointer to the Printer interface and already have a pointer to the Calculator interface, I do not call CreatePrinterComponent, I call QueryInterface on the calculator interface pointer and request a pointer to the printer interface. A more robust implementation will likely only allow a single creation function to drive home the fact that each call to Create will spawn a new COM server. Or alternatively, we might have a factory for all these services. The first time the factory is told to return an interface pointer, it creates a new COM server and caches a pointer to it while returning the requested interface pointer. Subsequent calls to the factory can then simply use the cached COM server to query pointers to the requested interface. All this is to say that the above code is not the best way to create the server, however it does a good job of illustrating the design choices to consider when writing your COM server.

Testing It Out

Alright! The above has all our definitions and implementation of the COM server. Let us try and see if it works!

We will use the following code to drive our testing:

	#include <iostream>
	#include "COMServer.h"

	int main()
	{
	ICalculator* calculator = nullptr;
	CreateCalculatorComponent(IID_PPV_ARGS(&calculator));

	std::cout << calculator->Add(3.f, 5.f) << std::endl;
	std::cout << calculator->Subtract(8.f, 3.f) << std::endl;

	std::cout << "Releasing calculator" << std::endl;
	calculator->Release();

	calculator = nullptr;
	CreateCalculatorComponent(IID_PPV_ARGS(&calculator));

	IPrinter* printer = nullptr;
	calculator->QueryInterface(IID_PPV_ARGS(&printer));

	printer->Print("Testing the print function!");

	std::cout << "Releasing calculator" << std::endl;
	calculator->Release();
	std::cout << "Releasing printer" << std::endl;
	printer->Release();
	}

view raw

COMTutorial13.h

hosted with ❤ by GitHub

The “IID_PPV_ARGS” macro is just a helper macro designed to automatically pass the uuid of the pointers type (for example the uuid of the ICalculator interface), and to also typecast the pointer of the pointer to void** as our Create and QueryInterface functions expect.

Running the code gives us the following output:

com pic1

Huzzah! We can access the Add and Subtract functions of the ICalculator interface and the results are correct. Calling release on our pointer also successfully deletes the server. We can then create a new ICalculator pointer which will in turn create a new server. Then we can query for the IPrinter interface from the ICalculator pointer, and successfully use the Print function of IPrinter. Finally notice how we have to release both the ICalculator component and the IPrinter component in order for the server to destroy itself. That is exactly the behavior we want! Feel free to change things around in the main function to test out the COM server and pat yourself on the back for making it this far. But wait……. theres more 😀

Extending COM Servers For Updates to Services

If you look at the code so far, we have been ignoring the return value of QueryInterface and Create which are of the type HRESULT. In actuality, proper use of those functions would check to see if the request succeeded, namely if the returned HRESULT was S_OK. We can use a helper macro called SUCCEEDED to verify the result of HRESULT. The following code is an example:

	ICalculator* calculator = nullptr;
	if (SUCCEEDED(CreateCalculatorComponent(IID_PPV_ARGS(&calculator)))) {
	std::cout << calculator->Add(3.f, 5.f) << std::endl;
	std::cout << calculator->Subtract(8.f, 3.f) << std::endl;

	std::cout << "Releasing calculator" << std::endl;
	calculator->Release();
	}
	else {
	std::cout << "Could not create a calculator component" << std::endl;
	}

view raw

COMTutorial14.h

hosted with ❤ by GitHub

If in the above code, we pass a pointer to an interface type that does not match what the function requires, then E_NOINTERFACE will be returned, and I will not attempt to use the pointer as a calculator object. This gives us an elegant way to deal with exceptions without resorting to exception handling which can get more messy.

This idea can be leveraged for more powerful functionality. Let us assume we are developers of this COM object we have made so far, and we wish to ship out an update. The update will extend the calculator service to also include multiplication and division. We wish to issue this update without breaking backwards compatibility. The COM pattern lets us do this quite well. We can simply add a definition for a new interface called ICalculator2 inside our COMServer.h file:

	MIDL_INTERFACE("e0d33026-b2c3-4404-b00f-76686cb6629e")
	ICalculator2: public ICalculator{
	public:
	virtual double STDMETHODCALLTYPE Multiply(const float& v1, const float& v2) = 0;
	virtual double STDMETHODCALLTYPE Divide(const float& v1, const float& v2) = 0;
	};

view raw

COMTutorial15.h

hosted with ❤ by GitHub

Notice how ICalculator2 inherits from ICalculator, thus everything ICalculator can do, so can ICalculator2 but it also has some extended functionality. This is a key observation, our updates cannot remove functions from previous versions, only add new functions. Now we will have InternalServer inherit from ICalculator2 instead:

class InternalServer : public ICalculator2, public IPrinter

view raw

COMTutorial16.h

hosted with ❤ by GitHub

And we will add the function overrides for Multiply and Divide in the definition of InteralServer:

	//ICalculator2Interface Methods
	virtual double STDMETHODCALLTYPE Multiply(const float& v1, const float& v2) override;
	virtual double STDMETHODCALLTYPE Divide(const float& v1, const float& v2) override;

view raw

COMTutorial18.h

hosted with ❤ by GitHub

We must update the implementation of InternalServer::QueryInterface to also have the ability of returning a pointer to the ICalculator2 interface:

	else if (riid == __uuidof(ICalculator2))
	{
	ppvObject = static_cast<ICalculator2>(this);
	this->AddRef();
	}

view raw

COMTutorial17.h

hosted with ❤ by GitHub

We then add the implementations of InternalServer::Multiply and InternalServer:Divide:

	double STDMETHODCALLTYPE InternalServer::Multiply(const float& v1, const float& v2) {
	return v1 * v2;
	}

	double STDMETHODCALLTYPE InternalServer::Divide(const float& v1, const float& v2) {
	return v1 / v2;
	}

view raw

COMTutorial19.h

hosted with ❤ by GitHub

And finally, we update out CreateCalculatorComponent to also accept the uuid of the ICalculator2 interface as an argument:

	HRESULT STDMETHODCALLTYPE CreateCalculatorComponent(REFIID riid, void** ppvObject) {
	if (riid != __uuidof(ICalculator) && riid != __uuidof(ICalculator2)) {
	return E_NOINTERFACE;
	}
	InternalServer* pServer = new InternalServer();
	return pServer->QueryInterface(riid, ppvObject);
	}

view raw

COMTutorial20.h

hosted with ❤ by GitHub

Now we can make both an ICalculator and ICalculator2 component in our driver code. We will use the following driver code in main.cpp to test that our new update works:

	#include <iostream>
	#include "COMServer.h"

	int main()
	{
	ICalculator* calculator = nullptr;
	CreateCalculatorComponent(IID_PPV_ARGS(&calculator));
	std::cout << calculator->Add(3.f, 5.f) << std::endl;
	std::cout << calculator->Subtract(8.f, 3.f) << std::endl;

	ICalculator2* calculator2 = nullptr;
	calculator->QueryInterface(IID_PPV_ARGS(&calculator2));
	std::cout << calculator2->Multiply(3.f, 5.f) << std::endl;
	std::cout << calculator2->Divide(15.f, 3.f) << std::endl;

	std::cout << "Releasing calculator" << std::endl;
	calculator->Release();
	std::cout << "Releasing calculator2" << std::endl;
	calculator2->Release();

	calculator2 = nullptr;
	CreateCalculatorComponent(IID_PPV_ARGS(&calculator2));
	std::cout << calculator2->Multiply(3.f, 5.f) << std::endl;
	std::cout << calculator2->Divide(15.f, 3.f) << std::endl;
	std::cout << "Releasing calculator2" << std::endl;
	calculator2->Release();
	}

view raw

COMTutorial21.h

hosted with ❤ by GitHub

Running this we get the following output:

com pic2

Great! That is what we were hoping for. Both QueryInterface and CreateCalculatorComponent are valid approaches to get access to the ICalculator2 interface. The new functions for ICalculator2 also work just fine. Needless to say that Add and Subtract can also be accessed through ICalculator2. Now the strength of COM is that we do not need to make any assumptions about what version of the COM server is available. In the case of DirectX, the DirectX dlls have the implementation for COM objects, and we can write client code that imports the library headers then makes requests to create interfaces. A good example is the interface IDXGIFactory. There are several versions of the interface. If interested in them, check out: https://docs.microsoft.com/en-us/windows/desktop/api/dxgi1_6/nn-dxgi1_6-idxgifactory6

However, while the application developer might have the header which defines up to IDXGIFactory6, the application user may only have a dll which implements up to IDXGIFactory4. As noted before, COM objects are well equipped to allow us to handle this situation.

Going back to our own COM object, as the developer we know that ICalculator2 exists since we have the newest library, however an application user might have an old version of the COM object, and will still attempt to run our client code. Thus we will build our code as follows:

	#include <iostream>
	#include "COMServer.h"

	int main()
	{
	ICalculator* calculator = nullptr;
	ICalculator2* calculator2 = nullptr;
	if (SUCCEEDED(CreateCalculatorComponent(IID_PPV_ARGS(&calculator)))) {
	if (SUCCEEDED(calculator->QueryInterface(IID_PPV_ARGS(&calculator2)))) {
	std::cout << "User has newest version of COM object. Succeeded in creating a calculator2 component." << std::endl;
	calculator->Release();
	calculator2->Release();
	}
	else {
	std::cout << "User does not have newest version of COM object. Continuing with a calculator component." << std::endl;
	calculator->Release();
	}
	}
	}

view raw

COMTutorial22.h

hosted with ❤ by GitHub

Essentially we start with the minimum version level that we are willing to allow our application code to run with. Then we QueryInterface up the versions until we reach the most recent version that the application user has and then proceed from there. A practical example of using this would be the following:

	IDXGIFactory2* dxgiFactory2 = nullptr;
	if (SUCCEEDED(dxgiFactory->QueryInterface(IID_PPV_ARGS(&dxgiFactory2))))
	{
	// This system has DirectX 11.1 or later installed, so we can use this interface
	dxgiFactory2->CreateSwapChainForHwnd( /* parameters */ );
	dxgiFactory2->Release();
	}
	else
	{
	// This system only has DirectX 11.0 installed
	dxgiFactory->CreateSwapChain( /* parameters */ );
	}

view raw

COMTutorial23.h

hosted with ❤ by GitHub

Here there is a certain function available only for the newer IDXGIFactory2, if it is available then we use the newer function, otherwise we resort to using the older one.

Well, thats it! If you reached this far, thanks for reading and I hope this post shed some light into the inner workings of COM objects.

Multiple Inheritance & Virtual Inheritance in Visual C++

Inheritance, and by extension, polymorphism is one of the more powerful tools of object oriented programming. Most people take for granted the implementation of inheritance however sometimes it is important to know what happens behind the scenes so that we are conscious of the cost and implication of our design choices.

Regular Inheritance – AKA Non-Virtual Inheritance

Whenever inheritance is involved, there must always be a base class which does not inherit from any other class, this class is called a primary class. Whenever a class has virtual functions, then that class must implicitly have an additional pointer called the vfptr, even if the class is a primary class. This is easy to check, you can write a class which has one or more virtual functions in it’s definition and no data members, and then running the sizeof function on the class will reveal that it is the size of one pointer. This additional pointer in question is used to point to the virtual function table. The virtual function table is a table with pointers to the correct versions of virtual functions for a given class that either has or inherits virtual functions and effectively gives us the possibility of polymorphism in C++. Each time you create a class that contains virtual functions, or you inherit virtual functions, the compiler creates a unique virtual table for that class. If this doesnt sound familiar, I recommend reading further about virtual tables. The following link might help:

https://www.learncpp.com/cpp-tutorial/125-the-virtual-table/

One thing to note with regular inheritance is, it doesnt matter how deep the inheritance is, the overhead is always the size of one pointer. This is because if C inherits from B, and B inherits from A, all three of these classes can share the same virtual function table. A vfptr at the top of the memory layout is sufficient to provide all the indirection that A, B and C need to reach the correct virtual function table. To that end, also notice that the memory address of A, B and C is all the same, namely the “this” pointer points to the single shared vfptr.

Multiple Inheritance

While typically advised against, there are still situations where multiple inheritance arises as a solution to a problem. Let us consider the case where we have primary classes A and B, and then C inherits from both non-virtually.

If A has any virtual functions, then we know it must have a vfptr, and the same applies for class B. However there is no way that these vfptrs can be merged (as in the case of single non-virtual inheritance) because B does not inherit from A. The “this” pointer for the instance of C will point to either the start of the A subobject or the start of the B subobject, where the start of either subobject is its vfptr (which vfptr C’s pointer will point to depends on the compiler, I believe Visual C++ handles inheritance left to right). We can verify that there are two vfptrs by doing sizeof(C), and indeed the size comes to that of two pointers assuming no data members.

It is worth noting here that, unlike in single non-virtual inheritance, typecasting C to either A or B might actually change the memory address the pointer points to. To this end, using reinterpret_cast<A*>(pInstanceOfC) or reinterpret_cast<B*>(pInstanceOfC) is a very bad idea. This is because reinterpret_cast does not modify its argument in anyway, it simply states “interpret the bits of my argument as the type of template” and nothing more. However, one of either A or B does not start at the same address of C, so this will go south very fast.

Now an interesting situation with multiple inheritance is the classic diamond problem. Where we have the following:

If we follow the non-virtual inheritance method above, then we will have two instances of A. D will encapsulate an instance of B and C, each of which will encapsulate their own instance of A. This may be the behavior you want, but for most this causes ambiguity. If I want to typecast D to A, which A are we talking about? If I want to access a member variable of A, which A are we talking about? There are after all two distinct memory addresses for A because there are two instances of A. If you really wanted to work with two instances of A, then you can resolve these issues by first typecasting D to either B or C, and then typecasting that to A. This effectively allows you to specify which A you are talking about. However, what if we only want a single instance of A? Perhaps A is just an interface with no data members and only pure virtual functions (this is the case for COM objects I discuss in a different blog post). This is where virtual inheritance comes in.

Virtual Inheritance

Virtual inheritance enables us to solve the diamond problem in multiple inheritance by guaranteeing that there is only one instance of a base class that is being virtually inherited. In code it might look something like this:

In this case, both B and C virtually inherit A, and now typecasting D into A is no longer ambiguous because virtual inheritance states that there will only be one instance of A. Seems like virtual inheritance is a good fix for our problems! Well, as with all things, there are always a few gotchas. The above example is one of the better use cases for virtual inheritance, because A is an interface with no data members and thus it is straight forward to work with. A less straight forward situation might be the following:

If we were to construct D with the values 1 and 2 as constructor arguments, then what value does m_Val in the single instance of A construct to? A decent guess is “either 1 or 2”, however that is actually not the case. The answer is 999. This seems contrary to what we might expect.

Consider for a moment, how virtual inheritance might work. Both classes B and C virtually inherit A, and thus they share a common instance of A, but who constructs A and who decides where A is placed in memory? While in non-virtual inheritance B and C would be responsible for the construction of A, because they own A, that is no longer the case. In fact, the object that now owns A is the most derived class D. The exact layout of memory will depend on the compiler, the patent for Visual C++’s implementation of inheritance suggests that the resulting memory layout will look like the following:

Since D is now the owner of the instance of A, it is up to D to construct A. However in the code sample we wrote, D’s constructor never calls the constructor for A, and so the default constructor is called. This assigns the value of 999 to m_Val. Once A has been constructed, it cannot be constructed again, thus B and C never actually call the constructor for A and so val1 and val2 are ignored. However, while B and C cannot reconstruct A, they can modify A, and so we can write the following code:

Now, if I print m_Val after constructing D, m_Val will always be 11. A is default constructed by D setting m_Val to 999, then the constructor for B is called setting m_Val to 10, then the constructor for C is called, setting m_Val to 11. This matches the order in which D inherits from B then C, swapping the order of inheritance would cause m_Val to be 10. This means that writing a constructor that uses a member initializer list will have different behavior than writing a constructor that assigns values to members in the constructor body. You can see now why virtual inheritance might get messy. This behavior is not intuitive and might be overlooked when writing code, leading to bugs later on.

Another interesting thing worth looking at is how do B and C know where A is in memory. After all casting D to B must give us a fully functional B, and for that to work, B must know where to find A. However, the position of A depends on the most derived type and thus varies. To resolve this, Visual C++ implements a virtual base table pointer (vbptr from here on). The vbptr points to virtual table that stores base offset from the address of B to the address of A. Thus if the starting address is hypothetically 12 bytes away from the starting address of A, then vbptr points to a virtual table that stores the base offset value “12”. The base offset value will be different for B and C because their starting addresses are different. The virtual table for base offset is different table than the virtual table for virtual functions at least in the implementation of Visual C++ (could vary for other compilers, they might merge the tables). This means that every class which virtually inherits from another class must store a vbptr to find its virtual base table. We can check by running sizeof(D) on the above code, and we confirm that the size of D is the size of an integer + the the size of two pointers (one vbptr for B and one vbptr for C).

Similar to how virtual functions introduce some small overhead due to the extra level of indirection required to call those functions, virtual inheritance introduces a small overhead due to the extra level of indirection required for it to access its base class.

We looked at the size footprint of vbptrs in virtual inheritance, but what about for vfptrs? In the case where D inherits from B and C, and B and C inherit non-virtually from A, then we only need at most two vfptrs. This is because if A declares some virtual functions, and then B and C both declare some new virtual functions, then with non-virtual inheritance we can merge the virtual functions of B with A, and the virtual functions of C with A to create two virtual function tables. To point to those two virtual function tables needs two vfptrs, and we can confirm this with the sizeof function.

However, when using virtual inheritance, two virtual tables is not sufficient. If we inherit virtually from a base class, we cannot merge the virtual function tables of the derived class and the virtual base. The reason for this is, the moment you virtually inherit from a base class, you are fundamentally declaring that there might exist other classes which also virtually inherit from the same base class in a given dynamic type, and so more than one class will inherit from the same instance of the virtual base. This should make sense, at the end of the day, if no other class is going to inherit virtually from your virtual base, then why have a virtual base at all? If only B will inherit from a given instance of A, then you may as well just use non-virtual inheritance, you would only use virtual inheritance if there MIGHT be another class which inherits from the same instance of A. In this case for the dynamic type D, there is indeed an instance of class C which virtually inherits from the same instance of A as B did, if there wasn’t, B could just non-virtually inherit from A.

Notice how above I said there there “MIGHT” be another class which inherits from the same instance of A. The issue is that B doesn’t know for a fact that there are, and even if it does, it doesnt know which other classes. All these factors depend on the dynamic type, and cannot be known at compile time. If we follow the optimization of non-virtual inheritance, and merge the new virtual functions that B introduces with the virtual functions of A to form a new virtual function table, then it stands to reason that we must do the same with C. Now suddenly we have three classes worth of virtual functions in one virtual table. If we typecast D to type C, then call one of C’s functions, how will C know where to find its own function in the virtual function table? C doesnt know that the dynamic type D also has an instance of B that added virtual functions into the same virtual function table. As such it will try to index into the virtual function table where it believes its own functions are, but instead there are B’s functions…. or maybe C’s functions are actually there, that depends on which order the compiler places the functions in the virtual function table. Clearly this isnt going to work out. It is for this reason that if a class derives virtually from a base while introducing new virtual functions of its own, it makes its own virtual function table with the new virtual functions, rather than merging with the virtual base class’ virtual function table.

Going back to our example of primary class A, B virtually inheriting from A, C virtually inheriting from A, and D inheriting non-vritually from B and C. If A introduces some virtual functions, it makes a virtual table and so has its own vfptr. If B doesnt introduce new virtual functions, but only overrides virtual functions from A, then we dont need a new vfptr, instead the vfptr of A just points to a different virtual function table now. If C introduces some new virtual functions, it cannot merge with the virtual function table of A, so a new one is made, and C has its own vfptr pointing to this new table. Finally if D doesnt introduce new virtual functions of its own, then nothing more is needed. However even if D introduces new virtual functions, an extra vfptr is not needed, instead both of the previous virtual function tables are extended to support D’s virtual functions creating new virtual function tables, and the two vfptrs point to those instead.

The total overhead here is two pointers worth for the two vfptrs. If however B were to be modified to add new virtual functions of its own, then a new virtual function table is made that cannot be merged with A’s. We would need a pointer to this virtual function table, and so B now has its own vfptr as well. In this case the overhead becomes three pointers worth.

To conclude our example, if we have the following class:

A will have its own vfptr (which will point to a virtual function table that points to the DoNothingA() that B defines). B and C will both have their own vfptr, each pointing to a virtual function table that point to DoNothingB, DoNothingD and DoNothingC, DoNothingD respectively. B and C will also have a vbptr to help them locate A, as the position of A depends on the most derived type (in this case D). The total memory footprint of all this indirection will be five pointers worth.

If the above is still somewhat confusing, I recommend a quick look at Figure 15 from Microsoft’s patent on how their compile handles inheritance. There isnt multiple inheritance, but the virtual inheritance concepts are still highlighted.

Here we have A1 as the primary class, and it has two functions fa11 and fa12. B3 inherits virtually from A1 and so has a vbpt. B3 overrides A1::fa11 and also introduces its own virtual functions fb31 and fb32 which mean that it must have its own vfptr. Finally, C3 inherits non-virtually from B3. It overrides B3::fb32, thus modifying the virtual function table that B3::vfptr points to, and it also introduces its own new virtual function fc31. Since C3 non-virtually inherits from B3, C3::fc31 can simply be appended to the virtual table of B3.

Resources:

https://patents.google.com/patent/US5754862

Good and bad sides of virtual inheritance in C++

Click to access MemoryLayoutMultipleInheritance.pdf

Forward Declarations vs Header Includes & Circular Dependency Problems

A very common situation in c++ is that we write a header that requires access to a class that is declared and defined in a different header. In this case, there are two approaches: Forward Declaration and Header Includes. For the rest of this post, let us assume that we have two classes: A and B.

Forward Declaration

If you make a forward declaration, this effectively declares the existence of a class, but it does not define it. If a header file only knows about the declaration of a class, it is impossible to use the class in anyway. Thus using forward declarations has limitations.

For example if you are defining class A in a header file “ClassA.h”, then the definition of class A cannot have a member variable B m_B. This is because in order to figure out the size of class A, you would need to know the size of class B, which is impossible if you do not have the definition of class B. This is a problem because when the compiler looks at the definition of a class, it must be able to figure out the size of the class in order to compile. Due to this, the definition for class A can only have either a pointer or reference to class B as one of its member variables. This is valid since the size of a pointer (and a reference since they use pointers under the hood) is always the same regardless of what class it is a pointer to, and thus the compiler can properly figure out the size of class A from the definition of class A. In the same vain you cannot call any of the functions of class B inside the header. That is, you cannot dereference B* in anyway, because to dereference a pointer would be to access class B itself, which we do not have the definition for.

Finally, keep in mind that if you do use forward declaration, then you must include the class header inside the source file. That is, if the “ClassA.h” forward declares class B so that the definition of A may store a pointer to B, then the source file that has the implementation of class A’s methods must include “ClassB.h” in order to access the definition of class B. If the source file which holds the implementation for class A does not include the header which has the definition of class B, then we have the same problem as previously described.

Using Includes

On the other hand, including “ClassB.h” in “ClassA.h” would give us access to both the declaration and the definition, and so there is no need to forward declare. This seems simpler right? Anything you might need, you may as well just include the header for in the header that you are currently writing, no problem! Well…. there are can potentially be consequences. One issue that might arise is compile times.

Consider that we are now writing a header that holds the definition of class C, called “ClassC.h”. The definition for class C has a member variable; B* m_B. Let say we wont bother to do a forward declaration and include “ClassB.h” in the source file, but instead we will include “ClassB.h” in “ClassC.h”. Now remember that class B’s definition also uses class A. Lets also assume that “ClassB.h” includes “ClassA.h”. Well, now realize that if you include “ClassB.h” you are also indirectly including “ClassA.h”. Suddenly class C depends on “ClassA.h” and any change made to “ClassA.h” will force a recompilation of class C. This is rather wasteful, because nowhere in the definition or implementation of class C, do we need class A, and yet we depend on it nonetheless. This issue is rather harmless for small projects, but it is possible that in larger projects such negligence can lead to very large compilation times. Following this pattern, its possible we might end up having hundreds of classes which depend on hundreds of headers, and suddenly changing a single header will cause half the project to recompile.

The other issue, which even people working on smaller projects run into, is circular dependencies and circular includes. This topic is a longer one so we will dedicate a whole section to it.

Circular Dependency and Circular Includes

While designing our project, we might however find ourselves in an situation where we have a circular dependency (This is very likely to occur in large projects, and in the process of building my DX12 abstraction layer, I found myself running into this). Say class A has a member variable B m_B, and class B has a member variable A m_A. It should be immediately clear that we have a problem here, the definition for class A depends on the definition of class B, and the definition of class B depends on the definition of class A. This is impossible to compile. Think about it from the compilers perspective, the compiler will look at one of these and when attempting to compile it, it will say:

“Well, the size of the first class depends on the size of the second class, so whats the size of the second class?”

“Well, the size of the second class depends on the size of the first class, so whats the size of the first class?”

This clearly looks like a loop that cannot be resolved and as such, compilation is out of the question. Based on what we discussed above however, we can reason that if instead, one of either class A or B used a pointer or reference to the other class instead, then that would be enough to break up this circular problem. So instead let us alter the definition of class B to instead have the member variable A* m_A. Now we are on the right track, and there is at least some hope of compilation.

We will start by looking at what someone unaware of circular dependencies in detail would do. They might have googled the circular dependency problem, and understood that they need to fix the above problem by using a pointer in one of the classes. After some research, they will likely write code that looks like the following:

Satisfied with their work, they will go ahead and press the compile button….. and I can hear it already, the sigh of frustration as they murmur “god damn compilers” when the following cryptic error presents itself:

Let us assume that with some further googling of the error codes, or by sheer luck through trial and error, the programmer in question decides to add a forward declaration of class A in “ClassB.h” such that they have the following:

They compile and huzzah! Victory! They believe they have found the secret; “you just need a forward declaration when the compiler complains”. As is normal during a project, they have other work to do, so understanding why this works is not of the utmost urgency to them and they move on. They reason loosely that probably the compiler for some reason didnt understand the type of the pointer, and so error C2143 was talking about a pointer, and that in turn led to the other errors and now everything is resolved because the forward declaration fixes the type of the pointer.

Some time passes and everything is fine. Then one day, the programmer must once again include “ClassA.h” and “ClassB.h”. This time around, they write the following:

They go ahead and press compile….. and I can hear it already, the classic sigh of frustration following by a murmur of “who even designs these stupid compilers” as they are greeted with the following error:

They are confused. This was working just fine in the past and on top of that, how can the definition of class A not know what the type B is, “ClassA.h” literally includes “ClassB.h”, surely it knows what class B is.

Let us look at what is happening here. Our source file which is being compiled starts by including “ClassB.h”, so the preprocessor will directly take the contents of “ClassB.h” and insert it into the source file. “ClassB.h” then has the directive to include “ClassA.h”, and so the preprocessor does this. “ClassA.h” then has the directive to include “ClassB.h” but the preprocessor stops there, because there is a #pragma once directive, and “ClassB.h” has already been included. So the compiler moves on with whatever has been inserted into our source file so far, and at the top of all this, is the contents of the header for class A (because “ClassB.h” was the first include in the source file, which in turn includes “ClassA.h” first and there are no further includes). The issue is now clear, the very first definition that the compiler is trying to compile is the definition for class A, but the definition for class A depends on the definition of class B, which does not yet exist. Even if “ClassA.h” includes “ClassB.h”, the #pragma once directive that served to break up the infinite loop means that the include for class B was ignored, and now the definition for class A does not in fact have access to the definition for class B.

The reason this worked the other way around (when the source file included “ClassA.h” first, and then “ClassB.h”), was because if “ClassA.h” is included first, we actually end up defining class B first due to the order of includes, and defining class B is perfectly valid as long as we just have a forward declaration of class A, since class B’s member variable is a pointer to an instance of A, and not an instance of A itself.

We can see that this whole configuration leads us to a very peculiar error that fundamentally depends on the order of includes. If your project is using just two header files like in this example, this isnt too much of an issue, you could try to remember that you must always include A before you include B. However if the project is even slightly larger, this can start to get complicated, and for very large projects, it could cause outright chaos. Having a project whose successful compilation depends on the correct order of header inclusion is a very bad situation to be in.

What is the solution to prevent this include order dependency? Well, we should only actually include headers for classes whose definition we need. Class B only uses a pointer to class A in its definition, and so a forward declaration is sufficient and we can remove the #include “ClassA.h” directive.

In this case if we include “ClassB.h” first then there is no directive in “ClassB.h” to include “ClassA.h” and so class B gets defined first (whereas before, class A was being defined first due to the include directive). If class B is defined first then there are no compilation issues, and that is the key. In general, we always want to define first the class which does not need the definition of the other class, and then we will not have any issues with our circular dependencies.

To conclude, as a general rule, for both compilation times and circular dependency reasons, you ideally want to forward declare in situations where a class does not need the definition of another class, and then include the header with the definition in the source file. And then make an effort to only have definitions of a class depend on definitions of another class in cases where it is really needed, otherwise consider just using a pointer or reference. Hopefully this post helps somebody who was as lost as I once was!

Present Latency, DWM and Waitable Swapchains

This is not the kind of blog post I will typically be writing, however I have lately been tinkering with the DXR API as I intend to implement path tracing (and maybe Bidirectional path tracing :O ) using the new API on the GPU. In the process I found myself on a tangent about the different ways to get contents from the GPU to the display, and the pros and cons of each. This post will talk about present latency, fps, Desktop Windows Manager, and waitable swap chains. I will not go into detail about the impacts of multiple backbuffers in the swapchain, nor the impact of having several command allocators as the post will be too long.

How do monitors work?

Monitors update in a scanline (LCD included), this is done by streaming data over the display link (DVI or HDMI or DP), and as the digital data arrives to the monitor, the monitor uses its DAC (Digital to Analogue Converter) to provide the analogue response for the monitor. Once a pixel is given a new analogue signal, the response time of the pixel determines how fast it will update to the new provided color. In an LCD display, pixels never turn off, they retain the previous color they had until a new color signal is issued to them. We can see the scanline effect in the following video of an LCD display captured in slow motion:

https://youtu.be/3BJU2drrtCM?t=269

To read about how the actual analogue signal produces an image on the monitor, you may read the following article:

https://www.rtings.com/tv/learn/lcd-vs-led-vs-plasma/how-they-work

How do we get information to the monitor?

This it the interesting part that I found myself investigating a lot after working with DX12, and is what this post will explore more in depth. Before we continue, it is best to get some terminology out of the way:

Backbuffer: A backbuffer is a buffer that serves as the render target of a GPU. An image is rendered into a backbuffer, and the contents of that are then used to bring a picture to the monitor. An application can have several backbuffers.

Frontbuffer: A frontbuffer is the specific backbuffer that is currently being used as the resource for which the monitor will pull image data from. Only one of the backbuffers can be a frontbuffer at any given moment.

SwapChain: A swapchain is the graphics API object that holds the backbuffers. The key thing to note about a swapchain is that calling SwapChain::Present will queue up a backbuffer to act as a frontbuffer. Thus in order to get a backbuffers contents onto the display, we call the Present function. The call to this function by the CPU is asynchronous. For this blog post we will assume a flippable swap chain, as in 2 or more backbuffers, that the swapchain will flip between when presenting.

CommandQueue: A command queue holds a series of instructions for the GPU to run. The GPU will run these instructions in the order that they exist on the command queue. When a swapchain is made, it must be associated with a command queue, the Present function will actually queue instructions into the command queue that the swapchain is associated with. It is possible to have several command queues, but that is out of the scope of this post, for now let us assume a single primary command queue.

CommandList: A command list records a list of commands that are to be queued to the GPU. The CPU can build a command list at any point using any thread, and then can submit the command list to a command queue using the function ExecuteCommandLists. Note that ExecuteCommandLists is asynchronous, that is, none of this work is done on the GPU until the GPU is ready to consume instructions from the command queue that the command list was queued to execute on.

CommandAllocator: A command allocator is the memory that backs a command list. A command list must be coupled with a command allocator when it is reset. A command allocator can be associated with many command lists over its lifetime, but only one at any given moment. It is also unsafe to reset a command allocator while its instructions are still pending in a command queue, as the command queue reads the instructions from this memory, and modifying it is unsafe.

With the above terminology out of the way, we can discuss the rendering process a bit. An application will use the CPU to build a command list associated with a command allocator. It will then call ExecuteCommandLists using that command list and the command queue. Finally the CPU will also call Present on the swapchain associated with that command queue. Notice that both of these operations were asynchronous, so no GPU work has yet to begin. The CPU has simply told the command queue to run the command list and then present the contents.

Assuming there is no prior work on the GPU. The GPU can commence running the instructions immediately, and then finally it will reach the point where it will call Present. This is where there are several possible courses of action. In the simplest case where we assume the monitor is streaming data directly from the frontbuffer (which is one of our applications backbuffers), the present call can flip which backbuffer is the frontbuffer, and the monitor will now stream from this new buffer. The moment the GPU is completed with a frame, the monitor will pull data from that frame, regardless of what scanline the monitor is on. In this very simple case we will have the minimum possible input latency (from here on out, I will refer to input latency as present latency, as input latency has other external factors associated), but we will have screen tearing. The following link does a good job of showing this kind of tearing: https://www.anandtech.com/show/2794/2

The alternative is that we have some synchronization going on. For this post I will ignore adaptive vsync (such as freesync or gsync), and will focus on vsync. Enabling vsync in an application will control when the swapchain can actually flip the frontbuffer. In the unthrottled case, we could flip the frontbuffer whenever we wanted, however we can be smart about this, and only flip the frontbuffer during a vertical blank (VBLANK), and in this case there will be no tearing. This is because the frontbuffer is not flipped while the monitor is streaming data from it, and thus the first scanline for each displayed frame will start with a fresh frontbuffer. We can reason that, any kind of waiting will obviously incur present latency. Lets say I call present, and the VBLANK is 10ms later, then the data that I am presenting is now 10ms old.

The impacts of vsync on present latency will depend on whether the application is in windowed mode or in full screen, and also on whether the swapchain is initialized a a waitable object (more on this later).

Windowed vs Fullscreen

In Windows Vista/7/8/10 there exists a process called “Desktop Windows Manager” (referred to as DWM from now on). The DWM is a process that uses the backbuffers presented by all the running applications, and composites them together into a final image that will be placed in a frontbuffer. In this case, the applications backbuffers do not directly turn into frontbuffers, but instead backbuffers are fed to the DWM. This composition allows Windows to do various visual effects, like layering applications on top of each other, giving a preview popup when you hover over an open application in your bottom bar, giving the overlay for alt-tabbing etc.

The way the DWM works in practice is that all the applications present backbuffers, and then after the vsync, DWM will pick up the most recently completed backbuffers from all the applications and then use the next frame interval to composite everything into its own backbuffer. Then during the next VBLANK, the DWM will present its own backbuffer which will become the frontbuffer the monitor will consume from. I would like to emphasize that, in this case the DWM is itself vsynced. The DWM itself has a swapchain, and it will only flip the backbuffers around during the monitor’s VBLANK, and so there is no tearing. Also notice that with the DWM, there will be two sources of latency. One source is the fact that flips only occur during a VBLANK, so there is a waiting period there after DWM has completed the composited image. The other source is the fact that DWM does not pick up the backbuffers that applications present until a VBLANK, and only after that does DWM begin compositing. Here is an example:

T=0ms, Application builds command list

T=1ms, Application executes command list in a command queue

T=3ms, GPU is finished rendering into backbuffer and frame is ready to be consumed

T=17ms, at this VBLANK, DWM wakes up, takes the latest backbuffer and begins composition

T=18ms, DWM is done compositing and the frame is complete in its backbuffer

T=34ms, at this VBLANK, DWM can flip the backbuffer it previously completed to become the frontbuffer that the monitor will stream data from

We can see based on the above timeline that in this case, the presence of DWM added around ~31ms latency. In contrast if we were operating in exclusive full screen mode without vsync, then at T=3ms the application would have been able to flip the backbuffer to become the frontbuffer. This is because in exclusive full screen mode, DWM is bypassed and the application’s backbuffers directly become frontbuffers, and the application is free to flip the buffers whenever it wants (keep in mind that we cause tearing like this, whereas with DWM it is not possible to have tearing).

Notice how in the above scenario, if Event 1 & Event 2 took a combined time of 16 ms (that is that GPU completing rendering into the backbuffer occurs at T=16ms), then the latency added by DWM is ~17ms which corresponds to 1 frame interval on a 60hz monitor. Thus we can say that at minimum, DWM adds 1 frames worth of latency. In practice, it will likely be slightly more. Usually a game is likely running at a fps higher than the refresh rate of the monitor. In this case, it is important to point out that DWM will pick up the most recently presented backbuffer that has been completely rendered into by the GPU. Let us look at a sample timeline:

T=0ms, Application builds command list 1

T=1.5ms, Application executes command list 1 in a command queue

T=4.5ms, GPU is finished rendering into backbuffer and frame is ready to be consumed

T=4.5ms, Application builds command list 2 for a new frame

T=6ms, Application executes command list 2 in the command queue

T=9ms, GPU is finished rendering into backbuffer and frame is ready to be consumed

T=9ms, Application builds command list 3 for a new frame

T=10.5ms, Application executes command list 3 in the command queue

T=13.5ms, GPU is finished rendering into backbuffer and frame is ready to be consumed

T=13.5ms, Application builds command list 4

T=15ms, Application executes command list 4 in the command queue

T=17ms, at this VBLANK, DWM wakes up, takes the latest backbuffer with the render of command list 3 and begins composition

T=18ms, GPU is finished rendering into backbuffer and frame is ready to be consumed

T=18ms, DWM is done compositing and the frame is complete in its backbuffer

T=34ms, at this VBLANK, DWM can flip the backbuffer it previously completed to become the frontbuffer that the monitor will stream data from

In the above situation, command list 4 was queued to do work on the GPU, however the GPU was not done rendering that frame in time for the DWM to pick it up, and so the frame produced by command list 3 is used instead. Command list 3 was built at 9ms, and finally made it on screen at 34ms, and so we can observe a latency of 25ms.

Using a tool called PresentMon, I took a capture of World of Warcraft running in DX12 with windowed mode:

WoW DWM Latency

In this case, composed flip means using a flip swapchain where DWM is composing the results to produce the frontbuffer. At ~144 fps we have ~24ms of latency (my monitor is 60hz). This is inline with the example we drew above. At a minimum the present latency is the sum of a frame interval and the time it takes to produce the frame. I say a minimum because it can be slightly more, depending on how close to the VBLANK the frame finishes at, if a frame takes 5ms to produce, and completes 4ms before the VBLANK, it will still be the frame that DWM will pick up, however it will already be 9ms old. On the other hand if the frame finishes 0.1ms before the VBLANK, it will only be 5.1ms old before DWM picks it up. For that capture of WoW, it took 6.93ms to produce the frame, and 16.67ms for a frame interval, giving 6.93 + 16.67 = 23.6ms, which is slightly short of the reported 24.35ms.

To further visualize this (and further discussion in this post), we will use a tool by Intel: https://software.intel.com/en-us/articles/sample-application-for-direct3d-12-flip-model-swap-chains

In this tool, the bottom row is the CPU timeline for frames, the 2nd row (from the bottom), is the GPU timeline, the following rows going up are the present queue (there can be several rows here), and the top most row is what is actually being displayed on the monitor. The following image is a capture of the tool simulating an experience at 240fps:

Intel Windowed DWM

There is a lot going on in this capture, but we need only focus on a few things. Each vertical black line represents a VBLANK. If we look at the leftmost side of the image, we see the CPU builds a command list (the leftmost one is red), then the GPU executes that command list, and then the frame is queued into the present queue. In the leftmost interval, the frame at the top of the present queue is blue, this is a frame that was built by DWM before the first VBLANK of this screenshot, and is what will be displayed on the monitor after the VBLANK. We can see that whatever was rendered into the red backbuffer is discarded (triangles in the present queue means the contents were discarded). The reason for this is because a more recent frame is built in the blue backbuffer. This blue frame is also discarded because the framerate is so high, that another frame is completed in time in the red backbuffer (the swapchain is alternating between the 2 backbuffers). We see that the CPU builds a blue frame before the VBLANK, but the GPU is not done executing the work for this command list before the VBLANK, and so the DWM consumes the red backbuffer upon waking up. You can notice that the red backbuffer is now the one at the top of the present queue in the 2nd interval, and will be on screen for the 3rd interval. This whole process yields a reported present latency of ~25.5ms (it is an average), and is a fairly accurate depiction of what happened in the World of Wacraft process.

For completeness, it is worth showing what the situation looks like when running in borderless windowed mode and bypassing the DWM. Before showing pictures, I would like to introduce some more terminology. If an application is covering the entire viewport, then DWM (at least in Windows 10 with DX12, I am unsure about DX11), is able to recognize this and engage in a mode called “Independent Flip”. Notice how earlier, PresentMon reported that we were running in “Composed Flip”. Independent Flip is a mode in which DWM does not composite anything, the backbuffers of an application directly become the frontbuffers that the monitor reads from, and all the DWM does is execute the flip during the VBLANK. In Independent Flip, an application is locked to a frame rate that is a multiple of the refresh rate. The reason for this is that once an application queues up a present, the backbuffer now belongs to the monitor and cannot be rendered into. Furthermore, whatever backbuffer is currently the frontbuffer, belongs to the monitor for the entire frame interval and also cannot be rendered into. Thus assuming we have 2 backbuffers in the swapchain, if one of them is held by the monitor to stream from, then we can only render into the remaining backbuffer, and the moment we queue the present with that backbuffer, we cannot use it either. Thus with 2 backbuffers, the maximum fps is 60. This scales with the number of backbuffers, and 3 backbuffers gives a maximum fps of 120.

I wont go into further detail on Independent Flip because in practice games do not really want to be using this flip mode. Instead, it is best to use a mode called “True Immediate Independent Flip”. This is a mode that doesnt just remove compositing, but removes the DWM entirely. That means that the application itself is responsible for flipping the frontbuffer, which allows it to present at an unthrottled rate (which is of course vulnerable to tearing). To run this mode, the swapchain must be initialized with tearing enabled, and the call to present must have the flag DXGI_PRESENT_ALLOW_TEARING. This flag has a value of “512” and so we can see above in the PresentMon capture of WoW, that they do in fact allow the application to enter True Immediate Independent Flip. We can confirm this with a capture of WoW in full screen using PresentMon:

World of Warcraft in True Immediate Independent Flip Mode

We see that the PresentMode reported is Independent Flip, and the PresentFlags of 512 indicate that the application can run in True Immediate mode. In this case the MsUntilDisplayed is the present latency, and we can see that it matches the MsUntilRenderComplete column (directly to the left of MsUntilDisplayed) perfectly. This means that as soon as a frame is complete, we put it on screen, and so the only present latency is the latency of actually rendering the frame itself. This is the absolute minimum possible latency that an application can achieve. This same latency was possible with full screen exclusive (different from borderless windowed), however alt tabbing from a FSE (full screen exclusive) application is usually very slow. True Immediate Independent Flip brings the best of both worlds, where we have the minimum present latency possible while still being able to alt tab quickly. Its worth noting that not all games engage in Independent Flip when the application covers the viewport, I am not sure about the exact criteria (it might require a certain version of DirectX). An example of this is Overwatch. When running Overwatch in borderless windowed mode, PresentMon captures the following:

Overwatch in borderless windowed (DWM composition)

I know that Overwatch runs on DX11, but I do not know more than that. However it is clear that we are not using hardware composition, and the DWM is still in charge (Copy with GPU GDI means we are using a blit model, not covered in this post and not possible in DX12). We see that a bunch of frames are discarded (dropped), and DWM will only pick up the most recent one to display on the monitor. The present latency hovers between 28ms and 40ms, which is around what we would expect based on earlier findings. Running Overwatch in FSE gives us the expected results:

Overwatch in full screen exclusive

Waitable Swapchains

SwapChain::Present is async on the CPU, however it is serial on the GPU and is only triggered after the commands in the command queue before it are finished. The moment the present is executed on the command queue, then the waitable swap chain is alerted of this, and the event that the CPU was waiting on is triggered. Likely this is implemented with fences behind the scenes, but I have yet to find conclusive information on the implementation of Present so I wont go into further detail. There is one extra parameter of a waitable swap chain that must be specified; maximum frame latency. Maximum frame latency is a value that describes how many presents the waitable swap chain can have queued up at once.

Waitable Swapchain – Windowed

Let us look at a windowed case using waitable swap chains. In the Intel application, a yellow bar in the CPU timeline means that the CPU is waiting on the swap chain event that signals that the GPU has reached the present instruction in the command queue. The following is such a capture:

Waitable SC 1

Here, when the CPU builds the command list for a given backbuffer, it waits until the GPU is done rendering into that backbuffer, and then builds the command list to render into the next backbuffer. This is because after building a command list, the CPU waits on the swapchain, and only once the GPU finishes rendering that command list, does the command queue signal to trigger the swapchain event. Notice that we can render into the backbuffer for the frame associated to the one that is currently displayed on monitor. This is because we are running in composite mode, and the buffer the monitor is holding is actually the buffer that DWM composites into and thus is not the same as the backbuffer that we write into.

Disabling the waitable swap chain in favor of a regular swap chain leads to increased present latency. This is because the moment that the GPU is done rendering into a backbuffer, the CPU builds the command list for rendering into that SAME backbuffer, whereas in a waitable swap chain, the CPU at this point would build the command list for rendering into the NEXT backbuffer (this is important to understand, think about this sentence and look at the above image and below image until you agree). The backbuffer will still take the same amount of time to become available (it must go through the DWM which will use it to composite and then release it), however now the CPU has built the command list for it earlier than in the case of a waitable swap chain. As the command list was built earlier, it is more outdated by the time the results are on screen, and so present latency will go up. The below image shows this (Blue indicates time the CPU is waiting on the GPU):

Waitable SC 2

Clearly a waitable swap chain in this case has less present latency. However, nothing is for free. By waiting on the swap chain, and having a maximum frame latency of 1, we have reduced some of the parallelism between the CPU and GPU. You may notice how in the waitable swap chain case, the work between the CPU and GPU looks very sequential, and that despite having a GPU fps of 77 and a CPU fps of 102, the total fps is 45. This is because the two devices are forced to run almost serially due to the waiting. This is in contrast with the case where we do not use a waitable swap chain, which has an fps matching the GPU fps (which is where the bottleneck is). In this case we are trading fps for present latency.

We can regain this fps lost in the waitable swap chain case by using a maximum frame latency of 2, however we pay back for this with present latency:

Waitable SC 3

In the case of maximum frame latency of 2, we are allowed to have 2 built frames before we must wait on the swap chain. This means that we effectively able to buffer an extra frame and the CPU and GPU are no longer working serially. However, buffering in this manner is what introduces the extra present latency compared to a maximum frame latency of 1. This should be familiar. Engineering is always about tradeoffs, and very rarely is something free. The question now becomes, in what cases might we benefit from a waitable swap chain? Because if there were to be no difference between a regular swap chain, it would be a rather useless addition. Based on the above, we can reason that, if the combined work time of the CPU and the GPU produces acceptable fps, then the tradeoff is likely worth it. Most users have monitors with a 60hz refresh rate, producing fps higher than this is a complete waste unless it contributes to reducing present latency. In the case where the application can present at an unthrottled rate and tearing is allowed, then higher fps results in reduced present latency. However when playing in windowed mode, the DWM is responsible for presenting to the screen and so higher fps is not always the best approach for minimizing present latency.

Consider the case below:

Waitable SC 4

Here we present at an unthrottled rate, the CPU has little work and most of the work is on the GPU, which is running at an fps of 91. The total fps is 87 and the present latency is 46ms. In this case, clearly the combined total of the GPU and CPU work is still less than ~17ms, and so perhaps this is a situation where a waitable swap chain might net us some gains.

Enabling a waitable swap chain gives us:

Waitable SC 5

We see that indeed, we can trade a bit of fps here and profit in the form of present latency. The loss in fps here will not be noticeable on a 60hz monitor and in turn the application is slightly more responsive. For a more practical situation worth analyzing, we can consider when a user is running the application with vsync. Likely the application will now be locked to 60 fps (on a 60hz monitor), but the time it takes to do the combined CPU and GPU work is probably less (otherwise there is no reason for the user to enable vsync).

Without a waitable swap chain and vsync on, we might see an application perform like the following:

Waitable SC 6

Here we can see that the CPU is building a command list significantly before the GPU has the buffer available to do work in. The top row of the present queue is the point at which DWM is holding the backbuffer for composting, and when it is released, the backbuffer is free for the GPU to render into (notice that whatever color buffer is being displayed on the monitor, is the buffer the GPU will render into for that frame interval, because the DWM just released that buffer on the previous VBLANK). We can see clearly that the combined work of the CPU and GPU will not exceed the 16.66 ms that we must adhere to. On the other hand, we have a significantly high present latency, because the CPU is building the command list way before the GPU has the resources available to do that work. This sounds like the ideal case to benefit from a waitable swap chain! Let’s take a look:

Waitable SC 7

Clearly, in no loss for frame rate, we obtain much better present latency. This makes sense, because the command list for a given backbuffer is built in the frame interval directly before the compositor will access it, and thus holds more recent information.

Waitable Swapchain – Borderless Window Full Screen

Finally, it is worth looking at the case of borderless window full screen, which most gamers are likely to run the application in. In this case, the present mode is independant flip, and the DWM is no longer compositing anything. We get the following result:

Waitable SC 8

We have a present latency of 40ms in this case, which is rather high for most real time applications. Enabling a waitable swap chain gives us:

Waitable SC 9

The present latency here is drastically better. It is clear that in the case of hardware that can run the application at above 60 fps, and vsync on, that a waitable swap chain is a significant win. In fact, I believe that this is the shortest present latency possible with vsync enabled (without vsync the present latency is the time to build the frame, as we established above). You can also clearly see here that the DWM compositor is no longer involved, a backbuffer is displayed on screen the VBLANK after it is completed by the GPU, whereas composition would consume an extra frame before displaying it on screen.

For completeness however, it is important to point out that using a waitable swap chain with a maximum frame latency of 1 can leave you vulnerable to fps spikes, or in certain cases can even cause a lot of harm. Consider the following example:

Waitable SC 10

Remember that the CPU and GPU do work sequentially in the case of a maximum frame latency of 1. First the CPU builds the command list, then the GPU renders it, however if the GPU does not finish in time for the coming VBLANK, and just barely misses it, then it must wait for the next VBLANK before the contents of the backbuffer make it on screen. We see this above where some frames last on the screen for twice as long as other frames. This is because the total time of the CPU and GPU took slightly longer than 1/refresh rate, and so a frame is missed completely. In cases like this, the effective fps becomes half, where we see above that GPU fps was 76 and CPU fps was 170, but total fps is 30. This will also cause the present latency to double, which is counter productive to our goal of using a maximum frame latency of 1. In this case a maximum present latency of 2 will be better:

Waitable SC 11

For the same present latency, we get double the fps, which will of course feel like a smoother experience to the user.

Essentially there is no one setting to rule them all. At worst, a waitable swap chain with a maximum frame latency of 2 will behave as good as no waitable swap chain. A waitable swap chain with a maximum frame latency of 1 is better if the user’s machine is able to render much faster than users monitor refresh rate. If the user does not care about tearing and will not use vsync, then for borderless window full screen, a waitable swap chain is not needed at all. As mentioned before in this post, in True Immediate Independent Flip the application can present as fast as possible with no delay on throughput, which looks like this:

Waitable SC 12

Conclusion

If the user does not care about tearing, or has an adaptive refresh rate monitor with a GPU that supports it, True Immediate Independent Flip is the best approach as it will deliver the minimum present latency. If the user cares about tearing and will turn on vsync, then the usage of a waitable swap chain might present some gains, but care must be taken in determining the maximum frame latency, as it is possible to drop the users fps with no gains in present latency (the case where we had 30 fps instead of 60 fps due to missing VBLANKs). We have also pointed out the impact of the Desktop Windows Manager on present latency and hopefully made its presence intuitive via the graphical aids created with Intel’s application. What we did not cover was the impact of frame count (the number of command allocators the CPU can build command lists into), nor the number of backbuffers available to the swap chain, and how those things impact present latency and fps. We also did not cover the notion of Triple Buffering vs a Render Ahead Queue (they are different things and are often confused with each other). Perhaps the topics left uncovered will find their way into a future blog post.

The Rendering Equation and BRDFs

The Lambertian Diffuse BRDF is one of the most commonly used BRDFs in computer graphics as it is very cheap to evaluate while providing decent results visually. Most people who have dabbled into computer graphics or written shaders will know the BRDF as $1/\pi$

The aim of this blog post is to explore the $1/\pi$ further and consider its physical implications as well as provide insight on how it was derived to get a better understanding of BRDFs in general.

Before proceeding further, a general recap of radiometry is necessary. The intent with BRDFs is to simulate light interaction, and so it pays off to think of this interaction in terms of physical quantities.

Radiant flux, denoted by $\Phi _{e}$ . Radiant flux is the radiant energy emitted, reflected, transmitted or received, per unit time. It has units of $J s^{-1}$ . which is also known as watts.
Radiant intensity, denoted by $I _{e, \Omega}$ . Radiant flux is the radiant flux per unit solid angle. As such, it has units $W sr^{-1}$ .
Irradiance, denoted by $E_{e}$ . Irradiance is the radiant energy received by a surface, per unit area. As such it has units $W m^{-2}$ .
Radiance, denoted by $L _{e, \Omega}$ . Radiance is the radiant flux emitted, reflected, transmitted or received by a given surface, per unit solid angle per unit projected area. As such it has units $W sr^{-1} m^{-2}$ .

The above, and the rest of this post, assumes you have a working understanding of what solid angles are, if not it then it is recommended to read about that first.

Armed with the above information, let us now consider what a BRDF actually is. To do this, it helps to look at the rendering equation:

$L_{o} = L_{e} + \int_\Omega L_{i} \cdot f_{r} \cdot cos\theta_{i} \, \mathrm{d} \omega_{i}$ Eq.1

$L_{o}$ is the out going radiance in the direction o. $L_{e}$ is the out going radiance emitted by the surface. The last term is the integral of the incoming radiance weighted by the cosine of the angle between the surface normal and the incoming direction and also weighted by the BRDF. The integral is over the hemisphere oriented around the surface normal.

The integral basically says, collect the incoming radiance from all directions visible to the hemisphere, which is why we integrate over $\Omega$ . Remember that we said that the units of radiance was $W sr^{-1} m^{-2}$ , if we look at the measure of the integral, we have $\mathrm{d} \omega_{i}$ , which has units $sr$ . If we multiply those 2 units together, we find that the $sr$ cancels out, and we are left with $W m^{-2}$ , which we previously said is the units for irradiance. This should make intuitive sense, the aim of this integral is to collect all the radiant flux incoming to this surface, which has some finite area. Now what does the $cos\theta$ term exist for? Let us look at the following image:

Untitled Diagram

Assume the green surface has an area of A. However in the slanted direction, the apparent surface area of the green surface is different, it is larger. Imagine a beam of photons being emitted perpendicular to the green surface. The area of the black surface that receives these photons is much larger from the slanted direction, than it is from the straight direction. If the green surface is emitting a constant radiant flux, then at grazing angles, the same radiant flux is spread over a larger receiving surface area, and thus we have less radiant flux per unit area. This can also be thought of in another way; given a fixed size receiving surface of area A, then the apparent size of A changes from the lights perspective based on the angle between the light direction and receiving surface normal. At grazing angles, the receiving surface appears smaller to the light, by a factor of $cos\theta_{i}$ , and in $L_{i}$ the $m^{-2}$ component is the area of the receiving surface as seen by the light in the direction i.

A notion that might help make this intuitive is thinking about what happens if you have a torch light. If you point the torch directly perpendicular to the floor, the light received by the floor appears bright. If however you point the torch light at a grazing angle , the light smears across a larger surface area, and appears dimmer.

We can see all of the above reflected in the standard units of radiance, which is radiant flux per unit solid angle per unit projected area. We note that $L_{i}$ is the incoming radiance seen if the normal of the receiving surface is parallel to the incoming direction i. However, as shown, at grazing angles, the incoming radiance is weaker as perceived by the receiving surface. We account for this with $cos\theta_{i}$ .

Knowing all this, we can say the following:

$E_{e} = \int_\Omega L_{i} \cdot cos\theta_{i} \, \mathrm{d} \omega_{i}$ Eq.2

The irradiance is given by integrating over all incoming radiance weighted by the projected area of the receiving surface in the direction of the incoming light directions. This makes physical sense, we collect all incoming radiant energy per unit area over the hemisphere, and this gives us the outgoing radiant energy per unit area.

The last term in the rendering equation is the BRDF, which was denoted by $f_{r}$ . The BRDF can be described mathematically as the ratio of differential outgoing radiance, to the differential irradiance:

$f_{r} = \frac{\partial L_{o}(\omega_{o})}{\partial E_{e}(\omega_{i})}$ Eq.3

Differential radiance can be thought of as the infinitesimal quantity of radiance reflected in a very small solid angle around the outgoing direction. Differential irradiance can be thought of as the infinitesimal quantity of irradiance received from a direction in a very small solid angle around the incoming direction.

We can rewrite differential irradiance as:

${\partial E_{e}(\omega_{i})} = L_{i} \cdot cos\theta_{i} \, \mathrm{d} \omega_{i}$ Eq.4

And then we can substitute Eq.4 into Eq.3 for differential irradiance, and then rearrange the terms to get the following:

${\partial L_{o}(\omega_{o})} = f_{r} \cdot L_{i} \cdot cos\theta_{i} \, \mathrm{d} \omega_{i}$ Eq.5

Eq.5 gives us the differential radiance, and clearly if we integrate this over the hemisphere, we will get the total radiance in the outgoing direction, which brings us back to Eq.1 (ignoring the emitted radiance of the surface, as that is not relevant to the BRDF).

So thus we see that the BRDF relates irradiance to outgoing radiance. It is worth mentioning that, radiance and irradiance differ in units by $sr^{-1}$ , which means the BRDF must have units $sr^{-1}$ . The BRDF typically depends both on the incoming direction, and the outgoing direction, and so we can rewrite it as $f_{r}(\omega_{o}, \omega_{i})$ .

The key about BRDFs is that they must be energy conserving. We cannot reflect more radiant flux than we receive as that would violate the laws of energy conservation, and so too must our BRDF follow if we wish for it to be physically plausible. Let us consider a simplified case for incoming light from a single direction.

Assume we have the differential irradiance $L_{i}(\omega_{i}) \cdot cos\theta_{i} \, \mathrm{d} \omega_{i}$ , where if light is only coming from a single direction $\omega_{i}$ then if we also assume some constant area for the receiving surface, then the entire expression becomes a constant, where we have radiance * unit of area * unit of solid angle, which will leave us with watts.

Now let us look back at what BRDFs actually are. Earlier we said that mathematically they are the ratio of differential outgoing radiance, to differential irradiance. We can describe this physically in another way; Walter et al 2007 describes the BRDF as the “fraction of incident energy scattered in a mode”, I will extend it further to say “fraction of incident energy scattered in an outgoing direction weighted by the inverse of the projected area in the outgoing direction”. What I mean when I say this is, a bunch of energy is incident onto our receiving area, and based on the properties of the receiving surface, different quantities of energy are scattered across all outgoing directions of the hemisphere. However, energy reflected in an outgoing direction is radiant intensity, the BRDF must produce radiance, and thus it must also account for the projected area of the surface in the outgoing direction. The last part is important and can be observed fundamentally in the derivations of BRDFs, as we will now explore.

We will try to observe this description of the BRDF by looking at the Lambertian BRDF. The Lambertian BRDF is known as a BRDF that has constant radiance, independent of viewing direction, but how this was derived is not always understood. The foundation of it lies in Lambert’s cosine law. Wikipedia reads “Lambert’s cosine law says that the radiant intensity or luminous intensity observed from an ideal diffusely reflecting surface or ideal diffuse radiator is directly proportional to the cosine of the angle θ between the direction of the incident light and the surface normal.” Note that in this description, direction of incident light is with respect to the observer, which means it is the direction of outgoing light with respect to the surface. This basically says that the number of photons per unit solid angle is dependent on the viewing direction and the surface normal, following a cosine distribution. The following image more clearly illustrates this:

aaj0u

As we can see, the outgoing radiant intensity forms a cosine lobe around the surface normal. So we can start forming our Lambertian BRDF with this, we can say that there is some irradiance, and then the radiant intensity reflected from that irradiance follows a cosine distribution, so we have $cos(\theta_{o})$ so far as the function that determines the fraction of incident energy reflected in a direction. However, as stated before, the BRDF must relate incident energy (irradiance) to radiance, so it is not enough to just define the distribution of energy, we must also factor in projected area. The question now is, how do we bring projected area into our incomplete BRDF? Well, at viewing directions that are almost orthogonal to the emitting surface normal, the apparent surface area of the emitting surface from the viewers perspective is very small. In fact, the emitting surface appears the largest when viewing it from a direction parallel to its normal, and then it drops off the greater the angle between the normal and the viewing direction. However, for a given radiant intensity, the smaller the apparent area of the emitting surface, the larger the radiance. Thus radiance is inversely proportional to $cos(\theta_{o})$ , and we must incorporate it into our BRDF as such. Using Lambert’s cosine law, and accounting for projected area, we have $\frac{cos(\theta_{o})}{cos(\theta_{o})}$ , which is just 1. This matches what we know about Lambertian diffuse, that radiance is independent of viewing direction.

However we do not yet have the well known $\frac{1}{\pi}$ In order to reach that form, we need to go ahead and normalize the BRDF, by forcing an energy conserving constraint on it. There are many different explanations for the energy conservation constraint, I will present the one that makes most physical sense for me. If the BRDF is fundamentally a fraction of incident energy scattered in an outgoing direction weighted by its apparent area, then if we undo the apparent area weighting, we should be left with a function that gives the fraction of incident energy scattered in an outgoing direction. Thus that would form $f_{r}(\omega_{i}, \omega_{o}) \cdot cos\theta_{o}$ as our function relating incident energy and outgoing energy. Energy conservation dictates that total outgoing energy must be equal to total incident energy, where total outgoing energy is obtained by integrating the outgoing energy in all directions over the hemisphere. If $f_{r}(\omega_{i}, \omega_{o}) \cdot cos\theta_{o}$ is fundamentally a distribution of energy, then it must integrate to 1. So we have the constraint:

$1 \geq \int_\Omega f_{r}(\omega_{i}, \omega_{o}) \cdot cos\theta_{o} \, \mathrm{d} \omega_{o}$

For the Lambertian BRDF we have built so far, we have $f_{r}(\omega_{i}, \omega_{o}) = 1$ , which we can plug into our constraint above and then solve the resulting integral to find our normalization factor for energy conservation. We transform from solid angle to spherical coordinates to make the integral easier to solve, where the jacobian determinant of this transformation is $sin(\omega_{o})$ giving us the integral:

$\int_0^{\pi/2} \int_0^{2\pi} 1 \cdot cos(\theta) \cdot sin(\theta) \, \mathrm{d}\phi \mathrm{d}\theta$

If we evaluate this integral, the result will be $\pi$ , and since our energy conservation constraints dictates that this must evaluate to 1, it tells us that our normalization factor is $\frac{1}{\pi}$ which leads us to the well known Lambertian diffuse BRDF.

The Journey Begins

This blog will hopefully develop with the intention of adding content pertaining to Compute Graphics. Posts will typically be about light transport in physically based rendering, graphics APIs (most focus is likely to be on DX12) and general c++ programming.