Quick Introduction to Mesh Shaders (OpenGL and Vulkan)

Introduction to Mesh Shaders in OpenGL and Vulkan

I recently added a basic support of mesh shaders in GeeXLab and here is an overview of mesh shaders for GL/VK devs based on the following articles published on the GeeXLab blog:

Mesh shaders are a feature introduced with NVIDIA Turing GPUs. The mesh shading pipeline replaces the regular VTG pipeline (VTG = Vertex / Tessellation / Geometry).

This illustration from NVIDIA shows the mesh shading pipeline versus regular VTG pipeline:

Mesh Shading Pipeline versus regular Vertex / Tessellation / Geometry pipeline

In few words, a mesh shader program can be seen as the association of a compute-like shader and a fragment shader. Why compute-like shader? Because like compute shaders, you can set the number of threads for work groups or use synchronization functions like barrier(). Mesh shaders are computing-like shaders specialized in graphics tasks.

The mesh shading pipeline adds two new shader stages: the task shader and the mesh shader. I haven’t played with the task shader yet, only with the mesh shader. But basically, a task shader generates work for mesh shaders while a mesh shader generates primitives (points, lines or triangles). A mesh shading pipeline can have a task, a mesh and a pixel shaders or a mesh and pixel shaders (the task shader is an optional stage).

Mesh shaders are available in OpenGL, Vulkan and DirectX 12 Ultimate (dx12 won’t be covered in this article).

The nice thing is that the mesh shader has no defined inputs. You can for example generate primitives ex-nihilo, in that case, there is no input, the primitives are generated procedurally (a primitive can be a point, a line or a triangle). But if you want the mesh shader to work on existing data, GPU buffers (uniform, storage) or textures are the way to pass data to the mesh shader.

Adding a basic mesh shaders support to an existing engine is simple. All you need is to add the new shader stages (OpenGL: GL_MESH_SHADER_NV and GL_TASK_SHADER_NV used with glCreateShader – Vulkan: VK_SHADER_STAGE_MESH_BIT_NV and VK_SHADER_STAGE_TASK_BIT_NV used in VkPipelineShaderStageCreateInfo) and a function to launch the execution of mesh shaders (glDrawMeshTasksNV in OpenGL, vkCmdDrawMeshTasksNV in Vulkan).

Currently, mesh shaders are only supported by NVIDIA Turing GPUs (GeForce RTX 20 Series, GeForce GTX 16 Series). According to some news, AMD RDNA2 GPUs will support mesh shaders too.

You can check the support of mesh shaders by looking at the presence of the GL_NV_mesh_shader extension in OpenGL or the VK_NV_mesh_shader device extension in Vulkan.

Let’s see a very simple GPU program, made up of a mesh shader and a pixel shader, that takes no input and generates a RGB triangle. The following mesh shader comes from the RGB Triangle sample available here:
Triangle Mesh Shader in OpenGL
Triangle Mesh Shader in Vulkan

Mesh Shaders - Triangle in Vulkan

The main objective of the mesh shader is to fill the following built-in output variables:
gl_MeshVerticesNV: vertices array. A triangle has 3 vertices.
gl_PrimitiveIndicesNV: indices array. A triangle has 3 indices, one per vertex.
gl_PrimitiveCountNV: number of primitives. A triangle is one primitive made up of three vertices.

#version 450

#extension GL_NV_mesh_shader : require

layout(local_size_x = 1) in;
layout(triangles, max_vertices = 3, max_primitives = 1) out;

// Custom vertex output block
layout (location = 0) out PerVertexData
  vec4 color;
} v_out[];  // [max_vertices]

const vec3 vertices[3] = {vec3(-1,-1,0), vec3(0,1,0), vec3(1,-1,0)};
const vec3 colors[3] = {vec3(1.0,0.0,0.0), vec3(0.0,1.0,0.0), vec3(0.0,0.0,1.0)};

void main()
  // Vertices position
  gl_MeshVerticesNV[0].gl_Position = vec4(vertices[0], 1.0); 
  gl_MeshVerticesNV[1].gl_Position = vec4(vertices[1], 1.0); 
  gl_MeshVerticesNV[2].gl_Position = vec4(vertices[2], 1.0); 

  // Vertices color
  v_out[0].color = vec4(colors[0], 1.0);
  v_out[1].color = vec4(colors[1], 1.0);
  v_out[2].color = vec4(colors[2], 1.0);

  // Triangle indices
  gl_PrimitiveIndicesNV[0] = 0;
  gl_PrimitiveIndicesNV[1] = 1;
  gl_PrimitiveIndicesNV[2] = 2;

  // Number of triangles  
  gl_PrimitiveCountNV = 1;

The pixel shader:

#version 450

layout(location = 0) out vec4 FragColor;

in PerVertexData
  vec4 color;
} fragIn;  

void main()
  FragColor = fragIn.color;

A mesh shader is limited in the number of vertices and primitives it can generate. Two important hardware limits are the maximum number of vertices and the maximum number of primitives that can be generated. In OpenGL, you can read these limits with GL_MAX_MESH_OUTPUT_VERTICES_NV and GL_MAX_MESH_OUTPUT_PRIMITIVES_NV:


In Vulkan, you have to read the following members of the VkPhysicalDeviceMeshShaderPropertiesNV structure:
– maxMeshOutputVertices
– maxMeshOutputPrimitives

Here is the dump of the entire VkPhysicalDeviceMeshShaderPropertiesNV structure for my GeForce RTX 2070 + latest R445.98:

– maxDrawMeshTasksCount => 65535
– maxTaskWorkGroupInvocations => 32
– maxTaskWorkGroupSize => [32;1;1]
– maxTaskTotalMemorySize => 16384
– maxTaskOutputCount => 65535
– maxMeshWorkGroupInvocations => 32
– maxDrawMeshTasksCount => [32;1;1]
– maxMeshTotalMemorySize => 16384
maxMeshOutputVertices => 256
maxMeshOutputPrimitives => 512
– maxMeshMultiviewViewCount => 4
– meshOutputPerVertexGranularity => 32
– meshOutputPerPrimitiveGranularity => 32

To launch the previous GPU program, just call glDrawMeshTasksNV in OpenGL or vkCmdDrawMeshTasksNV in Vulkan. In OpenGL, the GPU program must be bound before while in Vulkan, a pipeline built with the GPU program must be bound before.


unsigned int num_workgroups = 1;
glDrawMeshTasksNV(0, num_workgroups);


vkCmdBindPipeline(cmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, mesh_pipeline);
uint32_t num_workgroups = 1;
vkCmdDrawMeshTasks(cmdbuf, num_workgroups, 0);

Too bad that the parameters are not in the same order in OpenGL and Vulkan!

In the previous mesh shader, one thread per work group has been set:

layout(local_size_x = 1) in;

We can set more threads per work group, the maximum number of threads being 32 (32 is the size of a WARP on NVIDIA GPUs – more about WARP can be found in this article). This value comes from the reading of the first component of maxDrawMeshTasksCount (Vulkan) or GL_MAX_MESH_WORK_GROUP_SIZE_NV (OpenGL).

For the triangle, we can set the number of threads to 3 (one thread per vertex). In that case the mesh shader can be re-written in a more compact way:

#version 450
#extension GL_NV_mesh_shader : require
layout(local_size_x=3) in; 
layout(max_vertices=3, max_primitives=1) out;
layout(triangles) out;
out PerVertexData
  vec4 color;
} v_out[];   
const vec3 vertices[3] = {vec3(-1,-1,0), vec3(0,1,0), vec3(1,-1,0)};
const vec3 colors[3] = {vec3(1.0,0.0,0.0), vec3(0.0,1.0,0.0), vec3(0.0,0.0,1.0)};
void main()
  uint thread_id = gl_LocalInvocationID.x;
  gl_MeshVerticesNV[thread_id].gl_Position = vec4(vertices[thread_id], 1.0);
  gl_PrimitiveIndicesNV[thread_id] = thread_id;
  v_out[thread_id].color = vec4(colors[thread_id], 1.0);
  gl_PrimitiveCountNV = 1;

Let’s quickly talk about meshlets.

As said previously, a mesh shader can output (send to the rasterizer) only a limited number of primitives (point, lines or triangles). For example, on a GeForce RTX 2070, the mesh shader can output a maximum of 256 vertices and 512 primitives.

When primitive mode is set to triangle, the output of a mesh shader is always a small mesh, called a meshlet. The triangle is the smallest meshlet.

With these limitations (max output vertices and max output primitives), how can we process/render an existing big mesh with a mesh shader?

A way to render an existing mesh with a mesh shader is to decompose the mesh into multiple meshlets, each meshlet being processed by a work group.

Here is a more detailed definition of a meshlet (source):

What exactly is a Meshlet?
A meshlet is a subset of a mesh created through an intentional partition of the geometry. Meshlets should be somewhere in the range of 32 to around 200 vertices, depending on the number of attributes, and will have as many shared vertices as possible to allow for vertex re-use during rendering. This partitioning will be pre-computed and stored with the geometry to avoid computation at runtime, unlike the current Input Assembler which must attempt to dynamically identify vertex reuse every time a mesh is drawn. Titles can convert meshlets into regular index buffers for vertex shader fallback if a device does not support Mesh Shaders.

Here is a simple technique for creating meshlets from a single mesh (source):

So, a quite viable strategy for creating meshlets is: just scan the index buffer linearly, accumulating the set of vertices used, until you hit either 64 vertices or 126 triangles; reset and repeat until you’ve gone through the whole mesh. This could be done at art build time, or it’s simple enough that you could even do it in the engine at level load time.

A possible structure for a meshlet can be:

struct Meshlet
  uint32_t vertices[64]; 
  uint32_t indices[378]; // 126 triangles => 378 indices
  uint32_t vertex_count;
  uint32_t index_count;

In this article, NVIDIA recommends a maximum of 64 vertices and 126 primitives (or 3*126 = 378 indices):

We recommend using up to 64 vertices and 126 primitives. The ‘6’ in 126 is not a typo. The first generation hardware allocates primitive indices in 128 byte granularity and and needs to reserve 4 bytes for the primitive count. Therefore 3 * 126 + 4 maximizes the fit into a 3 * 128 = 384 bytes block. Going beyond 126 triangles would allocate the next 128 bytes. 84 and 40 are other maxima that work well for triangles.

Here is a mesh shader that handles meshlets:

#version 450

#extension GL_NV_mesh_shader : require

layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(triangles, max_vertices = 64, max_primitives = 126) out;

// transform_ub: Uniform buffer for transformations
layout (std140, binding = 0) uniform uniforms_t
  mat4 ViewProjectionMatrix;
  mat4 ModelMatrix;
} transform_ub;

// vb: storage buffer for vertices.
struct s_vertex
  vec4 position;
  vec4 color;

layout (std430, binding = 1) buffer _vertices
  s_vertex vertices[];
} vb;

// mbuf: storage buffer for meshlets.
struct s_meshlet
  uint vertices[64];
  uint indices[378]; // up to 126 triangles
  uint vertex_count;
  uint index_count;

layout (std430, binding = 2) buffer _meshlets
  s_meshlet meshlets[];
} mbuf;

// Mesh shader output block.
layout (location = 0) out PerVertexData
  vec4 color;
} v_out[];   // [max_vertices]

// Color table for drawing each meshlet with a different color.
#define MAX_COLORS 10
vec3 meshletcolors[MAX_COLORS] = {

void main()
  uint mi = gl_WorkGroupID.x;
  uint thread_id = gl_LocalInvocationID.x;

  uint vertex_count = mbuf.meshlets[mi].vertex_count;
  for (uint i = 0; i < vertex_count; ++i)
    uint vi = mbuf.meshlets[mi].vertices[i];

    vec4 Pw = transform_ub.ModelMatrix * vb.vertices[vi].position;
    vec4 P = transform_ub.ViewProjectionMatrix * Pw;

    // GL->VK conventions...
    P.y = -P.y; P.z = (P.z + P.w) / 2.0;

    gl_MeshVerticesNV[i].gl_Position = P;

    v_out[i].color = vb.vertices[vi].color * vec4(meshletcolors[mi%MAX_COLORS], 1.0);

  uint index_count = mbuf.meshlets[mi].index_count;
  gl_PrimitiveCountNV = uint(index_count) / 3;

  for (uint i = 0; i < index_count; ++i)
    gl_PrimitiveIndicesNV[i] = uint(mbuf.meshlets[mi].indices[i]);

Each meshlet is rendered by a work group. Then if you have to render num_meshlets, the drawing could be:

vkCmdBindPipeline(cmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, mesh_pipeline);
uint32_t num_workgroups = num_meshlets;
vkCmdDrawMeshTasks(cmdbuf, num_workgroups, 0);

The previous mesh shader comes from this Vulkan demo:

Mesh Shaders - Meshlets demo in Vulkan

Turing GPUs are limited to 65535 mesh tasks count (OpenGL: GL_MAX_DRAW_MESH_TASKS_COUNT_NV - Vulkan: maxDrawMeshTasksCount). In his Framework 4, Humus has a way to render more than 65535 meshlets:

// MaxDrawMeshTasksCount is currently set very low in NVIDIA drivers, 
// only 65535, so we may have to issue multiple calls if count is larger than that.
const uint max_count = device->MaxDrawMeshTasksCount;
while (count > max_count)
  vkCmdDrawMeshTasks(commandBuffer, max_count, start);
  start += max_count;
  count -= max_count;
vkCmdDrawMeshTasks(commandBuffer, count, start);

I will update this article with a simple example of a task shader as soon as possible...



If you have other links on mesh shaders, post them in comments, I will update this list of references.