WWDC was probably the most important one ever, at least as far as we - the
Metal developers - are concerned. I can wholeheartedly say it was the best week of my life, for sure!
Let’s get to the Games and Graphics news. The
most unexpected trophy goes to the renaming of
Metal to Metal 2. It has the most significant additions and enhancements since it was first announced in
2014, true, but let’s admit it: no one saw this one coming. The
most anticipated trophy goes to the new ARKit framework. We are only a few weeks after the keynote and there are already numerous bold and funny Augmented Reality projects out there. ARKit integrates with
Metal easily. Finally, the
most influencing trophy goes to VR. It is because of Virtual Reality that we are now able to achieve lower latency, enhanced framerates, as well as more powerful internal and now also external GPUs.
New features were also added to the
SceneKit frameworks. Other interesting additions are the
Vision frameworks used for machine learning. This article is only focusing on what’s new in
1). MPS - the Metal Performance Shaders are now also available on
macOS and the new additions to
- four new image processing primitives (
Element-wise Arithmetic Operations).
- new linear algebra objects such as
MPSTemporaryMatrix, as well as BLAS-style matrix-matrix and matrix-vector multiplication and LAPACK-style triangular matrix factorization and linear solvers.
- a dozen new
Transposeconvolutions were added to the already existing
- a new
Neural Network GraphAPI was added which is useful for describing neural networks using filter and image nodes.
Recurrent Neural Networksare now coming to help the
CNNsone-to-one limitation and implement one-to-many and many-to-many relationships.
2). Argument Buffers - likely the most important addition to the framework this year. In the traditional argument model, for each object we would call the various functions to set buffers, textures, samplers linearly and then at the end we would have our draw call for that object.
As you can imagine, the number of calls will increase drastically when multiplying the number of calls with the total number of objects and with the number of frames where all these objects need to be drawn. As a consequence this will limit the number of objects that will appear on the screen eventually.
Argument Buffers introduce an efficient new way of configuring how to use resources by adopting the indirect behavior that the constants have, and applying it to textures, samplers, states, pointers to other buffers, and so on. The argument buffer will now only have
2 API calls per object: set the argument buffer and then draw. With this approach many more objects can be drawn.
Using argument buffers is as easy as matching the shader data with the host data:
CPU, the argument buffers are created and used by an MTLArgumentEncoder object and they can be blit between
But it can get even better using the
dynamic indexing feature. A great use case is when rendering crowds. An array of argument buffers can pack the data together for all instances (characters). Then, instead of having two calls per object, now we can have only
2 API calls per frame: one to set the buffer and one to draw indexed primitives for a large instance count!
GPU will process per-instance geometry and color. The shader will now take an array of argument buffers as input, dynamically pick the character for any instance index, and return the geometry for that object:
Another use case for argument buffers is when running particle simulations. For this we have the
resource setting on the GPU feature which refers to having an array of argument buffers, one buffer for each particle (thread). All the particle properties (position, material, and so on) are created and stored in argument buffers on the
GPU so when a particle needs a specific property, such as a material, it will copy it from the argument buffers instead of getting it from the
CPU thus avoiding expensive copies between them.
A copying kernel is straightforward and lets you assign constant values, do partial or complete copies between a source and a destination object:
Finally, we also have the use case of referencing other argument buffers (
multiple indirections). Imagine a structure to represent an instance (character) that will have a pointer to the
Material structure such that many instances can point to the same material. Likewise, imagine another structure to represent a tree of nodes where each
Node would have a pointer to the
Instance structure which will act as an array of instances in the node:
Note: for now, only
Tier 2devices support all these argument buffer features. Starting with
GPUdevices are now classified as either
Tier 1(integrated) or
3). Raster Order Groups - a new fragment shader synchronization primitive that allows more granular control of the order in which fragment shaders access memory. As an example, when working with custom blending, most graphics
APIs guarantee that blending happens in draw call order. However, the
GPU thread parallelism needs a way to prevent race conditions.
Raster Order Groups do that by providing us with an implicit
In traditional blending mode race conditions are created:
All that is needed is adding the
Raster Order Groups attribute to the texture (or resource) with conflicting accesses:
4). ProMotion - only for iPad Pro displays currently. Without
ProMotion the typical framerate is
60 FPS (
ProMotion the framerate goes up to
120 FPS (
8.3 ms/frame) which is really useful for user input such as touch gestures or pencil using:
ProMotion also gives us flexibility in when to refresh the display image so we do not need to have a fixed framerate. Without
ProMotion there is inconsistency in image refreshing which does not bode well for the user experience. Developers usually trade away their peak framerate to constrain all of them to
30 FPS rather than the targeted
48 FPS (
20.83 ms/frame), to achieve consistency:
ProMotion we now have a refresh point every
4 ms rather than every
16 ms (the vertical white lines):
ProMotion is also helping in cases of dropped frames. Without
ProMotion we could have a frame that missed the deadline by taking too long to display:
ProMotion fixes this too by only extending the frame with only
4 more ms instead of a whole frame (
UIKit animations use
ProMotion automatically but to use
Metal views you need to opt in by disabling the minimum frame duration in the project’s
Info.plist file. Then you can use one of the 3 presentation
APIs. The traditional present(drawable:) will present the image immediately after the
GPU has finished rendering the frame (
16.6 ms on fixed framerate displays and
4 ms on
ProMotion displays). The second
API is present(drawable, afterMinimumDuration:) and provides maximum consistency from frame to frame on fixed framerate displays. The third
API is present(drawable, atTime:) and is useful when building custom animation loops or when trying to sync the display image with other outputs such as audio. Here is an example of how to implement it:
First, set a time when you want to display the drawable, then render the scene into a command buffer, then wait for the next frame(s) and finally examine the delay so you can adjust the next frame time.
5). Direct to Display - is the new way to send content from the renderer directly to external displays (eg. head mounted devices used in
VR) with the least amount of latency. There are two paths an image takes after the
GPU finished rendering it and before it ends on the display. The first one is the typical
UI scenario when the system is compositing it with other views and layers for a final image:
When building a full screen application that does not require blending, scaling or other views/layers, the second path is allowing the display direct access to the memory where we rendered to, thus saving a lot of system resources and avoiding a lot of overhead:
However, this only happens when certain conditions are met:
- the layer is opaque
- there is no masking or rounded corners
- full screen, or with opaque black bars and background
- the rendered size is at most as large as the display size
- color space and pixel format is compatible with display
The colorspace requirements makes it easier to know when
Direct to Display mode will work. For example, it is easy to detect if you are using a
P3 display and disable the
P3 mode when trying to use the
Direct to Display mode.
6). Other Features - include but are not limited to:
- memory usage queries - there are now new
APIsto query memory use per allocation, as well as total
GPUmemory allocated by the device:
- SIMDGroup scoped functions - allow data sharing between
SIMDgroups directly in the registers by avoiding load/store operations: ￼
- non-uniform threadgroup sizes - help us not waste
GPUcycles and avoid working on edge/bound cases: ￼
- Viewport Arrays on
macOSnow support up to
16viewports for the vertex function to choose from when rendering, and is useful for
VRwhen combined with instancing.
- Multisample Pattern Control - allows selecting where within a pixel the
MSAAsample patters are located and it’s useful for custom anti-aliasing.
- Resource Heaps are now also available on
macOS. It allows controlling the time of memory allocation, fast reallocation, aliasing of resources and group related resources for faster binding.
- other features include:
||Create textures from a
||Specialize bytecodes to change the binding index for shader arguments.|
||Add some 1-/2-component vertex formats and a
||Additional blending modes with two source parameters.|
I made a table with the most important new features, which states whether the feature is new in the latest version of the operating system or not. ￼
Finally, here are a few lines I wrote to test the differences between my integrated and discrete
All images were taken from
WWDC presentations and the source code is posted on
Github as usual.
Until next time!