Programmable Blending on Mobile and Desktop GPUs (OpenGL)

2012/09/20 JeGX

In the latest iteration of iOS (iOS 6), APPLE has exposed a new extension in OpenGL ES: GL_APPLE_shader_framebuffer_fetch. Even if it’s limited to mobile platform, this extension is interesting because it brings the programmable blending. On all current OpenGL implementation, blending is configurable (glBlendFunc) but not programmable.

In short, GL_APPLE_shader_framebuffer_fetch allows to read, from a pixel shader, the value of the framebuffer. This value can be then combined (it’s the programmable blending) with the current color value of the fragment shader to update the framebuffer.

Update: NVIDIA already exposes the same extension for its Tegra plateform: GL_NV_shader_framebuffer_fetch. According to the spec, GL_NV_shader_framebuffer_fetch exists since 2006. More infor can be found HERE.

Here is the overview of GL_APPLE_shader_framebuffer_fetch from the specification:

Conventional OpenGL blending provides a configurable series of operations
that can be used to combine the output values from a fragment shader with
the values already in the framebuffer. While these operations are
suitable for basic image compositing, other compositing operations or
operations that treat fragment output as something other than a color
(normals, for instance) may not be expressible without multiple passes or
render-to-texture operations.

This extension provides a mechanism whereby a fragment shader may read
existing framebuffer data as input. This can be used to implement
compositing operations that would have been inconvenient or impossible with
fixed-function blending. It can also be used to apply a function to the
framebuffer color, by writing a shader which uses the existing framebuffer
color as its only input.

GL_APPLE_shader_framebuffer_fetch introduces a new built-in variable in the GLSL: gl_LastFragData. gl_LastFragData is actually an array:

#extension GL_APPLE_shader_framebuffer_fetch : enable
vec4 gl_LastFragData[gl_MaxDrawBuffers];

For example, the additive belnding can be achieved with something like this:

#extension GL_APPLE_shader_framebuffer_fetch : require
void main()
{
  vec3 c = get_some_kool_color();
  gl_FragColor.rgb = c + gl_LastFragData[0].rgb;
  gl_FragColor.a = 1.0;
}

Now the important question: will we see the fetching of the framebuffer on current desktop graphics hardware in a near future? The very short answer is: no. And the short explanation is: the design of the graphics hardware on mobile plateforms makes it possible to access to the content of the framebuffer from the fragment shader while this is not possible on current desktop GPUs because it would involve significant architectural changes. Maybe in few years (NVIDIA Maxwell or AMD Pirates Islands GPUs?) …

For the curious reader, here’s a more detailed explanation that @grahamsellers gave me about the difference between mobile and desktop GPUs:

…
The reason that this works (or is possible) on the mobile cores is that they are tile based deferred renderers and shade all the geometry that touches a tile in huge batches. When the GPU is about to shade a tile, the contents of the tile are brought into on chip registers (not even memory, real registers) and stored there for the life of the tile. As registers, the GPU has extremely low latency access to them and so can provide their current values to the shader core.

On a desktop core, although rendering is still tile-based, it is not deferred and we have more of an immediate mode architecture where primitives are rendered as they are received by the core and are not batched up. The tile based blending hardware (the render backend) still keeps data on-chip, but using more of a traditional cache architecture. The shader core has no direct access to the contents of the cache. Also, because geometry is not sorted prior to blending, there is dedicated hardware in the backend to re-sort all the fragments to keep blending order correct. Primitives may be shaded out of order, so while the shader is running the most recent data isn’t even in the cache yet, which means having access to it really wouldn’t help.
…

And to end up this article, here are some related links:

20 thoughts on “Programmable Blending on Mobile and Desktop GPUs (OpenGL)”

Daniel Rakos 2012/09/20 at 18:00

Once again, Apple was not the first to expose this extension. It is available on at least all Tegra powered Android devices in the form of the NV_shader_framebuffer_fetch extension. Considering the Apple and the NVIDIA extensions are functionally equivalent I guess we are talking about the same extension, just Apple rebranded it.
JeGX Post Author2012/09/20 at 18:17

Thanks Daniel. I just read the NVIDIA spec, and GL_NV_shader_framebuffer_fetch and GL_APPLE_shader_framebuffer_fetch are the same extension. The same built-in gl_LastFragColor is present in both extensions.
Christophe riccio 2012/09/20 at 18:44

Tegra GPU is not a tile based GPU so I assume that this is implemented with round trip graphics memory access, so maybe not too poorly efficient but certainly power inefficient.

I am absolutely convinced that this could be better implemented in desktop GPUs, in a more programmable way if a part of the LDS (and maybe GDS) could be use and exposed in the fragment shader.
Christophe riccio 2012/09/20 at 18:48

*That would leave a lot of research work to do to implement efficiently OIT for example but in case of success, we could be much faster. The issue is how do you generate the parameter lists for each tile which with the rise of compute shader stage sounds not impossible.

=> More compute
=> Less bandwidth
=> More power efficient
=> More memory efficient
JeGX Post Author2012/09/20 at 18:55

LDS: Local Data Share, a memory structure in each compute unit (SIMD in Radeon)and exposed in OpenCL. GDS: Global Data Share, memory structure shared by the entire GPU (source: http://www.realworldtech.com/cayman/7/)…
Daniel Rakos 2012/09/20 at 19:10

LDS is also exposed in OpenGL in the new ARB_compute_shader extension.
Przemysław Lib 2012/09/20 at 19:44

Why Apple didn’t used NV_shader_framebuffer_fetch then?

Oh I know 😀 It would be Nvidia extension on PoweVR hardware under Apple OpenGL stack 😀
Christophe riccio 2012/09/21 at 00:32

Yes but not in fragment shader because it’s used for something else… 🙁
Martin 2012/09/23 at 00:18

As far as I understand it, the extension is called GL_EXT_shader_framebuffer_fetch in the final version of iOS 6. (Documentation is available here: http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_framebuffer_fetch.txt ). Thus the code has to be changed to
#extension GL_EXT_shader_framebuffer_fetch : require

BTW thanks for linking to the GLSL Programming wikibook. 🙂
Przemysław Lib 2012/09/23 at 20:48

Oh. Apple went 3rd way!

They expose GL_EXT_shader_fremebuffer_fetch (On iPod+iOS6 at least.)
jK 2012/09/24 at 04:02

The extensions are missing depth & stencil data 😡

And I still don’t understand why desktop GPUs aren’t able to implement programmable blending via an additional shader stage (even when having it in the fragment shader is nicer). Sure the highly optimized fixed function blenders are damn faster than a programmable one, but the same was likely true when fragment shaders were added.
All the CopyToTexture & PingPong solutions used right now aren’t fast either.
Daniel Rakos 2012/09/24 at 21:23

@jK
The reason desktop GPUs don’t implement programmable blending is that they have sophisticated caches and in-order pipes that allow you to do extremely fast blending and framebuffer writes, even multiple triangles that touch the same pixel can be theoretically processed in parallel. This is not true for mobile GPUs afaik, but they are not performing that fast either.

Also, having a separate shader stage for blending would be an overkill, even if it would exist, it should be part of the fragment shader as otherwise you would have to submit twice as many threads to the GPU cores.

I’m not saying that implementing programmable blending is impossible to implement for desktop GPUs, but it doesn’t help much as blending is still not order independent. You have to use linked list buffers to do efficient order independent transparency anyways. Sorting objects on the GPU is just not efficient.

Also, you don’t have to do any copy-to-texture or ping-pong on desktop. First, you can attach a texture to an FBO, thus no need for copy-to-texture. Second, you can bind the same texture for texturing that you use as framebuffer attachment and use NV_texture_barrier to ensure consistency, thus virtually you can more or less implement programmable blending with it just you have to call glTextureBarrierNV between draw calls to ensure that framebuffer writes are flushed.
Daniel Rakos 2012/09/24 at 21:27

Yeah, forgot, NV_texture_barrier obviously also ensures that you don’t have to ping-pong.
Martin 2012/09/25 at 15:54

Well, in my understanding, NV_texture_barrier makes it pretty clear that you don’t even need to call glTextureBarrierNV as long as you read and write to the same pixel (“Specifically, the values of rendered fragments are undefined if any shader stage fetches texels and the same texels are written via fragment shader outputs, even if the reads and writes are not in the same Draw call, unless any of the following exceptions apply: […] – There is only a single read and write of each texel, and the read is in the fragment shader invocation that writes the same texel (e.g. using “texelFetch2D(sampler, ivec2(gl_FragCoord.xy), 0);”).”)

Thus, any hardware that supports NV_texture_barrier could probably just as well support EXT_shader_framebuffer_fetch. Or am I missing something?
Daniel Rakos 2012/09/26 at 00:29

Yes, you miss something. When you render a scene objects overlap at least partially, thus you have to call glTextureBarrierNV between draw calls for sure. Also, if you have concave polyeders then the same overlap scenario might occur. Even with convex polyeders, if you don’t have backface culling enabled (which is likely for transparent surfaces) then you can have overlaps.

But doesn’t worth to even talk about this as you should not forget that blending is not commutative (at least most blending functions aren’t) thus programmable blending doesn’t solve the problem of rendering transparent surfaces. Order-independent transparency is what solves it.

Of course, if what you want to achieve is not rendering generic transparent objects, but instead e.g. decals or UI elements (which usually don’t overlap with themselves), then you can use NV_texture_barrier to implement programmable blending without the overhead of OIT.
Martin 2012/09/26 at 08:49

@Daniel: I would assume that GL_EXT_shader_framebuffer_fetch also has the problems with order-independent rendering that you are describing, don’t you think so?
jK 2012/09/27 at 06:20

glTextureBarrierNV was an extension till GL4.2, and so you cannot/couldn’t expect it to exist.
Neither works it afaik when polygons overlap in a single drawcall.
Never the less programmable blending shaders are not just worth for OIT, but also for smooth particles, volume fog, more complexe blending (color inverse with float textures, contrast aware blending, …).
Daniel Rakos 2012/10/01 at 21:22

> Neither works it afaik when polygons overlap in a single drawcall.
That’s what I’ve said too.

> Never the less programmable blending shaders are not just worth for OIT, but also for smooth particles, volume fog, more complexe blending (color inverse with float textures, contrast aware blending, …).

Exactly, but all that can be done using shader image load/store. Also, OIT cannot be done with EXT_shader_framebuffer_fetch while with image load/store it can.
Martin 2012/10/04 at 22:37

You lost me. I was replying to this comment by Daniel:

> Also, you don’t have to do any copy-to-texture or ping-pong on desktop. First, you can attach a texture to an FBO, thus no need for copy-to-texture. Second, you can bind the same texture for texturing that you use as framebuffer attachment and use NV_texture_barrier to ensure consistency, thus virtually you can more or less implement programmable blending with it just you have to call glTextureBarrierNV between draw calls to ensure that framebuffer writes are flushed.

My point being that if you want to use NV_texture_barrier to implement programmable blending, you don’t have to call glTextureBarrierNV. I think this is pretty clear in the specification of the extension.
Radek Mackowiak 2012/12/20 at 16:17

Sadly this extension doesn’t work for me. 🙁

F-Shader:
#extension GL_APPLE_shader_framebuffer_fetch :require

varying lowp vec4 colorVarying;

void main(void) {
gl_FragColor = gl_lastFragData[0] + vec4(colorVarying.x, colorVarying.y, colorVarying.z, 1.0);
}

Debug output:
extension ‘GL_APPLE_shader_framebuffer_fetch’ is not supported

Tried to run it on the iOS 6.0 iPad Simulator ‘n on an actual iPad with 6.0

Comments are closed.