Press "Enter" to skip to content

The disastrous failures to optimize dual-depth peeling with NSight

In this post I will do my best to explain all my attempts towards optimizing the above algorithm. All of them were unsuccessful, so these two guys in NVIDIA (Louis Bavoil, Kevin Myers) did a pretty good job.

However, I managed to get a few extra percent utilization in the CROP unit(from 32%-33% to 41,4% for All Actions). If you do not know what all this is about, it will become clear once you read the blog post.

First let’s start with a brief explanation of what the algorithm does and why the heck it might be useful to you under certain circumstances.

Dual Depth Peeling is an algorithm that utilizes order-independent transparency (OIT). Essentially, the core of the algorithm is to peel two layers at a time of your geometry – one at the front and one at the back. Traditional peeling algorithms required to draw all geometry every layer, this one breaks this and it is way more useful, if you have a lot of transparent planes and objects. On top of that it can be implemented in such a way that it does blending on the fly.

To blend the front layer we need the go front-to-back with the UNDER blending

Cdst = Adst (Asrc Csrc) + Cdst, and when pre-multiplied alpha is used Cdst = Adst ( Csrc) + Cdst

To blend the back layer we need to implement the back-to-front alpha blending with the OVER operator.

Cdst = Asrc Csrc + (1 – Asrc) Cdst, and when pre-multiplied alpha is used Cdst = Csrc + (1 – Asrc) Cdst

At this point you might be like, alright good, but how do we actually separate them in layers 2 at a time. Well the trick is to use GL_MAX as the blending equation. The first thing you would do is to make a “warm up” pass over the geometry writing the furthest and lowest depth only, so in our case we write only Color = vec2(-gl_FragCoord.z, gl_FragCoord.z); and that is the only line in the fragment shader for the depth init program, which stores these values in two textures (color attachments). Why two textures? Well, we need to do ping-ponging(a technique) later in the loop iterations to avoid read-modify-write hazards (you can’t read and write to the same texture on the gpu). Then you make a loop doing N-iterations (however many you decide is visually most pleasing and there are no disheartening artifacts). In each of these iterations (geometry passes) you compare the fragment depth value sampled from the init pass and if it is on the furthest or the closest layer you peel it, otherwise you forward it and keep it until its depth layer comes.

if(Depth < NearestDepth || Depth > FarthestDepth)
{
        //NOTE(enev): Skip this depth in the peeling algorithm.
	DepthStorage.xy = vec2(-MAX_DEPTH);
	return false;
}
	
if(Depth > NearestDepth && Depth < FarthestDepth)
{
	//NOTE(enev): This fragment needs to be peeled again!
	DepthStorage.xy = vec2(-Depth, Depth);
	return false;
}

And peeling essentially comes down to a simple if statement like the following!

//NOTE(enev): Prevent further peeling
DepthStorage.xy = vec2(-MAX_DEPTH);
if(FragDepth == NearestDepth)
{
   FrontStorage += TexelColor * AlphaMultiplier;
}
else
{
   BackStorage += TexelColor;
}

At the end of each iteration you have to blend the back storage samples separately, because if you attempt to do it on the fly the max blending will kill you. (or so you might think!)

Each subsequent iteration blends the remaining fragments until you either blend all of them or reach the max iterations of the loop.

Finally we would need to blend our last front and back buffer with the respective front alpha in one last screen pass.

float AlphaMultiplier = 1.0 - FrontColor.a; //NOTE(enev): Remaining alpha
Color.rgb = FrontColor.rgb + BackBlender * AlphaMultiplier;

I will strongly recommend for you to take a look at the original paper and download the source, compile and play with it, before proceeding further in the post. Also you might want to download the NVIDIA graphics debugger NSight (thank god they finally managed to make it so that I can get rid of using Visual Studio). It is quite a useful tool.

What this beautiful algorithm does is enable us to avoid sorting of any kind and when it comes to managing depth you can hardcode your values if you wish and forget about it. (I do not recommend it). On top of that if you are a 2D game which does not have that much geometry and if you are like me you would want to do one rendering algorithm to solve all of your problems this post is probably worth looking at.

The bad side of this technique is that it does multiple passes over the geometry as already mentioned and also it does heavy blending on a lot of buffers which punishes the gpu quite a bit. I will talk about it in the next section of this post!

Alright we warmed up now, in my game, which surprise, surprise is a 2D game with transparency objects and relatively low number of geometry. I decided for the fun of it to render everything with this algorithm even the profiler, introspection tool and memory management utility all of them are transparent windows with a ton of text and widgets.

How did I notice there was a problem? Well at Full HD (all color attachments of the framebuffer are of the same size) there was a slight decrease in framerate when I tried to draw absolutely everything the engine and the game can offer. Amusingly enough later it turned to that the slowness was, because of the glBegin and glEnd queries, and the algorithm was performing very well for a hard defined number of iterations like 4 peels or 5 peels.

The final result of my unsuccessful optimizations was a 8% to 9% increase in the performance of the CROP unit of the gpu, which is in theory was our bottleneck. Trading that for lost performance for every pixel in my back peels and one more devious issue, which I will reveal at the end of the post. All attempts where done at 1920×1080 single core CPU, and around 10 ms total rendering and blending on the gpu + around 2-3 ms cpu data upload time. That is around 2-3 ms for the cpu to do other stuff. Ahh, and btw my gpu is not something fabulous it is NVIDIA 1050TI.

Lets now dive at my reasons and guesses of what went wrong and how did I attempt to fix it. Even if these guesses now proved to be wrong they might actually help in future performance investigations.

Firstly, I have examined my game’s performance and noticed I was gpu bound trying to draw absolutely everything at once. This means every window, layout and the two passes over the game map (one was the debug view and one the normal actual gameplay overlayed on top of each other). All of this drawing normally would not happen under shipping circumstances, but if you know me I wasn’t happy. I figured this with the NSight profiler.

I made a couple of screen shots for you to get acquainted with the data.

initial version of the depth peeling algorithm
non-optimized

If you have never tried to do optimization on the gpu (like me) at first it might seem a little strange. Lets keep going we will take a look at the top interconnecting units and we will see how much each of them was utilized for the given task. At the top of the pipeline overview is the CROP unit which means that we are bound on the blending unit of the gpu. Not only that our utilization percent or as NVIDIA refers to it SOL (speed of light) is very low. If you hover with your mouse over the image, it can be noticed that under the section with pipeline overview the top SOL(speed of light) unit is the CROP one with utilization of 34,6%.

I started investigating how can I possibly increase the utilization of the unit. There were a couple of things that came up. One of the was to lower the amount of work of this unit. How? There was not much to do about that, since the algorithm itself relies on heavy blending of draw buffers. Well, then can I lower their sizes? The depth peel is the only one that could be compressed further so if we attempt to cut the depth precision in half what would happen?… Black screen… Why? Well, I had problems with the floats at first I thought well I am just going to move the bits into the respective slots from f32 to f16 and be done with it, however I was not complaint with IEEE 754. That gave me the black screen. Now after attempt two things started looking a “little” bit nicer, but it was very apparent and we could immediately see the incorrect depth comparisons leading to artifacts.

Other suggestions were to lower the number of color attachments of our main rendering framebuffer. So I thought about how to do that. Our initial setup had 7 total color attachments 2 for the front blend, 2 for the back storage, 2 for the depth buffers, and 1 more to blend the back storage each iteration. 7 was really rough!

And the last more reliable suggestion for optimization was to try to spread the load across different units of the pipeline.

I figured I could combine both the second and third suggestions. If we can’t lower the buffer sizes Can we transmit the workload to some of the other units? Well, yes! We can do back blending like we did front on the same two back buffers. However if we want to do that we need to surpass somehow the GL_MAX blending equation. To do that we would need to store our colors in floating point format so that we can increment every pass with some bogus value the new blended color that we got, in this way it will be taken by the blend equation. For this to work we will set our back peels texture format to be GL_RGB32F. And now you would be like, but hey wait a minute aren’t we trying to minimize the size of the color buffers instead of increasing them. The two back peels are now 3 times bigger. How is that a win? We receive some benefits which at first are hidden. First we have to notice that when blending the back fragments we always do a fullscreen pass with the most recent back peel and we eliminated this. Wait, wait some of the more introduced audience might think that we would just change the blending equation to every attachment, but we can’t really do that unless we are above OpenGL 4.0. This engine targets the biggest possible audience, therefore we need to stick with the lowest possible version and this is 3.3. What I did then every pass was to do a little bit more computations utilizing some free SM cycles. Every fragment going through the back peel will do a mod with a what I call a bogus value to what was previously in the back peeled layer to get the color values between 0-1. So our strategy is to add a number (bogus_value) increasing with each pass so that the GL_MAX every time will pick it up! Then we will do an extract (again with mod), but that extraction will happen once essentially at the end of all computations to the inevitable back and front blend, so we don’t mind it. At this point it worked fine, but it was way slower and I need to do something more about it. So, I changed to GL_RGB32F format to the smaller GL_RGB16F. Now in color attachment sizes we were equal to the initial version. But we got slower, exactly because of this mod overhead and extraction computation. I went further and said that I wanted to use the GL_R11G11B10F, which fitted perfectly in a dword and that meant we would be able to save in terms of color buffers sizes. Now we have 6 attachments. There was a slight issue however, the precision of these 11-bit floats. I chose my random bogus_value to be 10.0 at first I actually pretty quickly exceeded the exact precision range and got wild artifacts. When I read about the IEEE 754 11-bit float standard I saw that it was able to represent values exactly, but they had to be below 32. That was the clue to change the bogus_value now to just 2.0. That was the final little optimization trick I could think of, but still this new implementation was slower. Even though we got 8-10% better utilization of the crop unit because of the 6 instead of the 7 color attachments.

#define BOGUS_VALUE(2.0) //NOTE(enev): The smallest possible increment for R11G11B10
#define ONE_OVER_BOGUS_VALUE(1.0/2.0)
void OutputToPeeledTextures(...)
{
DepthStorage.xy = vec2(-MAX_DEPTH);
if(FragDepth == NearestDepth)
{
        float AlphaMultiplier = 1.0 - FrontAlpha;
	FrontStorage += TexelColor * AlphaMultiplier;
}
else
{
        //NOTE(enev): The blending on the back can't happen if we do not add a
        //bogus value in the computation that is incremented in each subsequent  blend of the same fragment,
        //Otherwise we wont pick our value, because our blending equation is currently MAX blend.
        //mod(BackTempStorage, BOGUS_VALUE)
		
        float Div = BackTempStorage.r * ONE_OVER_BOGUS_VALUE;
        //float Floor = floor(Div);
        //NOTE(enev): We can allow direct truncation since we know 
        //that our value will never exceed the int max or min
        //and we know that we are always positive
        float Whole = BOGUS_VALUE * int(Div);
 // * Floor;
        vec3 DistributedWhole = vec3(Whole, Whole, Whole);
        vec3 Remainder = BackTempStorage.rgb - DistributedWhole;
		
        DistributedWhole += vec3(BOGUS_VALUE, BOGUS_VALUE, BOGUS_VALUE);
        //NOTE(enev): Add the bogus value after the computation, so that we can always,
        //pick what we blended even if we are with MAX blend!
        BackTempStorage.rgb = Remainder * (1.0 - TexelColor.a) + TexelColor.rgb + DistributedWhole;
}
}
failure at optimizing
“optimized”

Why was it slower? It turned out that what was holding us back from the first implementation was that even if there were 7 attachments and more blending the fact is that it happened simultaneously. What this means is that at the end of each loop iteration the current back peel was blended with the back blender, but during this time the previous peel is not occupied and can already start the new iteration, get cleared and ready for use for the next shader program as the third draw buffer.

for(u32 Pass = 1; Pass < MAX_DUAL_DEPTH_PEEL_ITERATIONS; Pas++)
{
   u32 CurrentPassID = Pass % 2;
   u32 PrevID = 1 - CurrentPassID;
   u32 RenderBufferID = CurrentPassID * 3; 
   //...Do all the clears of 0,1,2 or 3,4,5 - 0/3 with vec2(MIN_DEPTH)

   //Set the draw buffers and the equation.
   glDrawBuffers(3, &OpenGLAllColorAttachments[RenderBufferID]);
   glBlendEquation(GL_MAX);

   //NOTE(enev): Render target 0/3: 32F DEPTH COMPONENT, 
   //Render target 1/4 RGBA MAX BLEND, 
- front colors
   //Render target 2/5 RGB MAX BLEND - back colors
   //You can do however many shader programs you want sequentially mine were 6
   BeginDepthPeel(Active);
   BindTexture(1, Peel->DepthAttachments[PrevID], Active->Common.Peel.DepthTexLoc);
   BindTexture(2, Peel->FrontBlenderAttachments[PrevID], Active->Common.Peel.FrontBlenderTexLoc);
   //=========================================
   //NOTE(enev): My attempt to optimization was forwarding and blending the back peels on the fly while eliminating the bellow back blend phase
   BindTexture(3, Blend->BackPeelAttachments[PrevID], Active->Common.Peel.BackBlenderTexLoc);
   //===========================================
   DrawActiveEntities(Active, Stack->IndicesPerQuad);
   EndDepthPeel();

   //5 more shader program peels....
   
   // Do the back blend, notice here they used the peel attachment with CurrentPassID, on the next iteration the one with PrevID will be now the CurrentPassID and since there is no friction between them during the blend of the back peel the new iteration will start async allowing for faster execution.
   //NOTE(enev): Set the back blender as the render buffer.
   glDrawBuffer(OpenGLAllColorAttachments[6]);
   glBlendEquation(GL_FUNC_ADD);
   glBlendFuncSeparate(GL_ONE, GL_ONE_MINUS_SRC_ALPHA, GL_ONE, GL_ONE_MINUS_SRC_ALPHA);
   BeginDualBlend(Blend);
   //NOTE(enev): jumping between 2,5 respectively. 
   BindTexture(1, Blend->BackPeelAttachments[CurrentPassID], Blend->TempBackTexLoc);
   DrawScreen();
   EndDualBlend();
}

Nevertheless it was worth the try! I learned a ton of stuff and I hope that this post wat at least a little bit interesting, showing how spreading a bigger workload asynchronous is much better, than trying to optimize it to a smaller workload, but keeping the flow of execution in order on the GPU!

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *