This year has been an incremental one for the Metal framework. The Metal Shading Language (MSL) is now at version 2.4, however, the GPU families and feature set tables are unchanged since last year, with the Apple7 GPU family including the M1 . Let’s get to all the exciting additions at WWDC21 and start with the new ray tracing features.
Normally, you have a compute pass after each render pass and you are saving intermediate render attachments to system memory after each pass so they can communicate to each other. Apple GPUs use the tile memory to hold the processed data for the current pass. At the end of the pass the tile memory is copied to system memory and before the new pass begins the tile memory is loaded again from system memory:

However, this is not the most performant way to intersperse render and compute because of the multiple copies from/to memory. With the new additions to the API this year, you can now simply integrate your ray tracing code into your render pass and only have one write to system memory at the end:

Setting up the render pipeline for ray tracing is similar to setting up a compute pipeline. with the new API, you can create an intersection function table from the pipeline state and then use the linked functions by calling their handle from the table:

Also new this year is the way intersections are processed. In the old API you used an intersector
to traverse the acceleration structure and find the closest intersection:

However, with this method you had to call the intersection function any time an intersection test was required, which means you had to create a new intersection function and link it to the pipeline. With the new API, you can now use the intersection query
method which allows you to use in-line code without creating an intersection function:

The greatest advantage, however, is that you can still use either approach based on your specific case and the combination of factors your code is exposed to:

Also new this year are user-defined instance IDs
which are different from the existing instance IDs the system is returning for you when querying instances in that they are customizable by the user so you can include a 32-bit string representing a per instance color, for example, and this ID is now a new property of the instance descriptor. Similarly, instance transforms
are also new this year. You are now able to access the transformation matrices also in the shader as an instance property.
In terms of production rendering, this year the API has introduced extended limits, which you will need to opt-in by providing the extended_limits
tag to the intersector object:

Motion blur is perhaps the most visually pleasing feature introduced this year. As you know, you can have both primitive motion and instance motion, however, the former is more expensive. For instances, you can have separate transformation matrices in the acceleration structure for moving primitives as well as for the static primitives. In this case, the sphere has two matrices, the start and end of the bouncing animation:

The previous instance descriptor only allowed one transformation matrix per instance. With the new motion instance descriptor
you can now connect the old instance descriptor to a transform buffer and thus allow more transformations per instance:

For primitive motion you have more granular control over the animation but at a higher cost. In the shader, you will need to specify the primitive_motion
tag when animating primitives and the instance_motion
tag when animating instances.
Rasterized shadows can be in the form of shadow maps, which is the most common technique. You need to first render from the light’s perspective for all the lights in the scene. Then, for each light you need to process all pixels in the scene to determine if the current light affects the current pixel. This method is the not optimal because you need to process the scene multiple times, you have aliasing because the shadow map has a predetermined resolution and you are also missing information about the rest of the scene (pixels that didn’t make it to the shadow map):

For ray-traced shadows you start by rendering from the camera towards a light source and check if any object is in the path, while building the depth map with this information about objects found in the path. If there is no object in the path, you retain this light source for later calculation of the shading. Next, you pass the depth map and acceleration structure to the compute shader. Here, you calculate the pixel position, then trace a ray in each light direction and determine if pixel is in shadow or not. The shadow texture produced in this step can then be combined with the one from the render pass. This method produces more natural shadows and also provides information about pixels that are outside of the light’s view or the camera’s view.

Rasterized ambient occlusion is sampling from the depth map and surface normals of nearby pixels to determine if there are objects that might occlude the ambient light coming to this pixel. An attenuation factor is calculated based on the number of occluding objects found nearby and then this information is stored in a texture that can be combined in the resulting image. As with the rasterized shadows, being a screen-space technique, rasterized ambient occlusion also suffers from missing information about pixels outside the image.

Ray-traced ambient occlusion does not rely on screen-space information but on actual geometry in the scene. As with ray-traced shadows, you need to pass an acceleration structure to the compute shader, as well as information about depth and normals which are available in the G-Buffer. Depth and normals are needed to generate random rays inside a hemisphere centered at the current pixel. Then you trace these rays to find nearby occluders that can contribute to the attenuation factor.

Rasterized reflections have always been notoriously difficult. One method involves the using of reflection probes placed in six directions and making a cube map. However this technique suffers from limited resolution, requires pre-filtering to accurately reflect irradiance and struggles with dynamic scenes:

Another method is screen-space reflection which overcomes some of these drawbacks by using information from pixels already in the framebuffer. However, this technique also carries the limitations of screen-space techniques mentioned earlier and ray marching can be computationally expensive:

Ray-traced reflections can overcome all these problems by using the scene geometry information from the acceleration structure, as well as the depth and normal already saved in the G-Buffer. Using all this information in the compute shader you can calculate the view vector from the camera to the current point, reflect this vector about the normal, trace the reflected ray from the point, check for intersections and then shade the pixel accordingly:

Bindless is a modern binding model that allows access to resources on the GPU. By aggregating and linking all resources together, it is now possible to bind only one buffer to the pipeline and provide access to the resources needed at any given time by navigating the buffer. The bindless model is implemented in Metal with the argument buffers tier 2
construct which is available on the Apple6
(devices with A13 cpu or newer) and Mac2
(Macs newer than 2015) GPU families.
A complex scene can be stored in an argument buffer containing all instances of an object, as well as the information about materials and meshes for each instance. Now you can pass the argument buffer and treat it as an array or you can pass a pointer to it in another buffer. As an extra option, you can present the buffer as a MTLHeap
if you need to sub-allocate resources from heaps:

Once you have your argument buffer configured for bindless rendering you can proceed to creating your command encoder. One way to do it is via reflection. If the argument buffer is passed to the shader function, you can ask the MTLFunction
object to create the encoder. This method is not fit for cases when buffers are indirectly referenced, or when the resource buffers are created separately from the pipeline state creation, or simply when the function is expecting an array to be passed as argument:

The second way to create a command encoder is by using a MTLArgumentDescriptor
which allows describing the struct members and creating the encoder without a MTLFunction
. First, you need to create a descriptor for each struct member and specify its data type and binding index:

Next, you pass the descriptors directly to the MTLDevice
to create the encoder:

Finally, the encoder is passed back:

Once you have the encoder created, it’s easy to encode your scene data into the argument buffer, by offsetting the i
instances of data:

Finally, since the bindless scene is now encoded, you can navigate it in the shaders. You first look for intersections with the scene. If one was found, you will be able to query the instance_id
, or the geometry_id
, or even the primitive_id
inside your kernel:

Bindless rendering is not only useful for ray tracing but also for a more efficient rasterization of physically-based rendering (PBR) models with various and complex material properties.
New this year is also the ray tracing debugger. You can dig into your passes, select an acceleration structure, for example, to open the acceleration structure viewer. At the top left side of the viewer you can navigate all the way down to the primitive level:

You can also see the intersection functions used for the selected geometry:

At the bottom right side of the viewer, you can see the view modes:

You can also view and change the scene traversal settings:

For profiling, this year brings us the new GPU Timeline in the Metal debugger which combines the timeline from the Metal System Trace with the real time information provided by the GPU Counters:

Instruments gains a new bar in the Metal System Trace this year, called GPU Performance State which will let you induce various workloads and test if your GPU can sustain them. You can then use the saved trace for further profiling in the Metal Debugger. This will also allow you to test the performance of your profiled code on other devices by going to the Xcode menu, clicking Devices and Simulator under Window, then choosing a device from the list and finally selecting the desired profile state:

In terms on new debugging features, shader validation
is extended this year also to indirect command buffers, dynamic libraries, function pointers and tables. Also new, the precise capture controls
that allows you to capture any number of frames between 1-5, as well as capturing buffers and other resources, including devices:

Another useful feature, the new pipeline state workflows
provides more granular control into your pipeline states, letting you see information about its resources. The new separate debug information
feature now allows you generate separate .metallibsym
files that contain information for debugging and profiling, without the need to embed additional information in the .metallib
libraries or having two versions of libraries:

Finally, selective shader debugging
allows you to debug large shaders by narrowing down the debug scope to particular functions you want to pay attention to.
Moving on to GPU texture compression, the TextureConverter compression pipeline has been completely redesigned to a multi-stage fully configurable pipeline and the conversion tool is available for both macOS and Windows. For visual data like geometry and lights, data should be encoded in non-linear space (like RGB), however, for non-visual data like normals, data should be encoded in linear space:

The linear space contain three main stages, each with its own substages. The transform stage contains two substages:

The next stage, generate mipmaps, will require that you provide a maximum mipmaps value and a filter to use. Finally, the alpha handling stage also has two substages:

After the linear space ends processing, the final stage is the gamma space where compression takes place and which has two substages:

The texture compression families are as follows:

Finally, this year we get a new compilation workflow that makes it easier for us to balance compilation times with shader performance. Also new this year is support for dynamic libraries for render and tile pipelines, in addition to the support for compute pipelines that was introduced last year. Function pointer support for render and tile pipelines is also available this year.

Binary archives now get support for also storing the visible
and intersection
functions, as well as support for caching function pointers:

This year, Metal linked functions get support for private functions too:

And lastly, this year we get support for function stitching which allow to dynamically generate content at runtime for cases like reacting to user input, for example. They are also available to all macOS and iOS families. This tool is much more performant than generating Metal source strings, which was the only other way to achieve this before.

That’s a wrap! All images in this article belong to Apple. For more information about what’s new in the Metal API, you can check out these resources:
Until next time!
One reply on “What’s new in Metal at WWDC21”
Thanks for a great summary, as usual! I don’t know when these features will become fully available, but there will be lots to try out when the time comes.
LikeLike