Skip to content

Using ARKit with Metal

Augmented Reality provides a way of overlaying virtual content on top of real world views usually obtained from a mobile device camera. Last month at WWDC 2017 we were all thrilled to see Apple’s new ARKit framework which is a high level API that works with A9-powered devices or newer, running on iOS 11. Some of the ARKit experiments we’ve already seen are outstanding, such as this one below:

There are three distinct layers in an ARKit application:

  • Tracking – no external setup is necessary to do world tracking using visual inertial odometry.
  • Scene Understanding – the ability of detecting scene attributes using plane detection, hit-testing and light estimation.
  • Rendering – can be easily integrated because of the template AR views provided by SpriteKitand SceneKit but it can also be customized for Metal. All the pre-render processing is done by ARKit which is also responsible for image capturing using AVFoundation and CoreMotion.

In this first part of the series we will be looking mostly at Rendering in Metal and talk about the other two stages in the next part of this series. In an AR application, the Tracking and Scene Understanding are handled entirely by the ARKit framework while Rendering can be handled by either SpriteKitSceneKit or Metal:

To get started, we need to have an ARSession instance that is set up by an ARSessionConfigurationobject. Then, we call the run() function on this configuration. The session also has AVCaptureSessionand CMMotionManager objects running at the same time to get image and motion data for tracking. Finally, the session will output the current frame to an ARFrame object:

alt text

The ARSessionConfiguration object contains information about the type of tracking the session will have. The ARSessionConfiguration base configuration class provides 3 degrees of freedom tracking (the device orientation) while its subclass, ARWorldTrackingSessionConfiguration, provides 6 degrees of freedom tracking (the device position and orientation).

alt text

When a device does not support world tracking, it falls back to the base configuration:

if ARWorldTrackingSessionConfiguration.isSupported { 
    configuration = ARWorldTrackingSessionConfiguration()
} else {
    configuration = ARSessionConfiguration() 

An ARFrame contains the captured image, tracking information and well as scene information via ARAnchor objects that contain information about real world position and orientation and can be easily added, updated or removed from sessions. Tracking is the ability to determine the physical location in real time. The World Tracking however, determines both position and orientation, it works with physical distances, it’s relative to the starting position and provides 3D-feature points.

The last component of an ARFrame are ARCamera objects which facilitate transforms (translation, rotation, scaling) and carry tracking state and camera intrinsics. The quality of tracking relies heavily on uninterrupted sensor data, static scenes and is more accurate when scenes have textured environment with plenty of complexity. Tracking state has three values: Not Available (camera only has the identity matrix), Limited (scene has insufficient features or is not static enough) and Normal (camera is populated with data). Session interruptions are caused by camera input not being available or when tracking is stopped:

func session(_ session: ARSession, cameraDidChangeTrackingState camera: ARCamera) { 
    if case .limited(let reason) = camera.trackingState {
        // Notify user of limited tracking state
func sessionWasInterrupted(_ session: ARSession) { 
func sessionInterruptionEnded(_ session: ARSession) { 
    // Optionally restart experience

Rendering can be done in SceneKit using the ARSCNView’s delegate to add, update or remove nodes. Similarly, rendering can be done in SpriteKit using the ARSKView delegate which maps SKNodes to ARAnchor objects. Since SpriteKit is 2D, it cannot use the real world camera position, so it projects the anchor positions into the ARSKView and then renders the sprite as a billboard (plane) at this projected location, so the sprite will always be facing the camera. For Metal, there is no customized AR view so that responsibility falls in programmer’s hands. For processing of rendered images we need to:

  • draw background camera image (generate a texture from the pixel buffer)
  • update the virtual camera
  • update the lighting
  • update the transforms for geometry

All this information is in the ARFrame object. To access the frame, there are two options: polling or using a delegate. We are going to describe the latter. I took the ARKit template for Metal and stripped it down to a minimum so I can better understand how it works. First thing I did was to remove all the C dependencies so bridging is not necessary anymore. It will be useful in the future to have it in place so types and enum constants can be shared between API code and shaders but for the purpose of this article it is not needed.

Next, on to ViewController which will act as both our MTKView and ARSession delegates. We create a Renderer instance that will work with the delegates for real time updates to the application:

var session: ARSession!
var renderer: Renderer!
override func viewDidLoad() {
    session = ARSession()
    session.delegate = self
    if let view = self.view as? MTKView {
        view.device = MTLCreateSystemDefaultDevice()
        view.delegate = self
        renderer = Renderer(session: session, metalDevice: view.device!, renderDestination: view)
        renderer.drawRectResized(size: view.bounds.size)
    let tapGesture = UITapGestureRecognizer(target: self, action: #selector(self.handleTap(gestureRecognize:)))

As you can see, we also added a gesture recognizer which we will use to add virtual content to our view. We first get the session’s current frame, then create a translation to put our object in front of the camera (0.3 meters in this case) and finally add a new anchor to our session using this transform:

func handleTap(gestureRecognize: UITapGestureRecognizer) {
    if let currentFrame = session.currentFrame {
        var translation = matrix_identity_float4x4
        translation.columns.3.z = -0.3
        let transform = simd_mul(, translation)
        let anchor = ARAnchor(transform: transform)
        session.add(anchor: anchor)

We use the viewWillAppear() and viewWillDisappear() methods to start and pause the session:

override func viewWillAppear(_ animated: Bool) {
    let configuration = ARWorldTrackingSessionConfiguration()
override func viewWillDisappear(_ animated: Bool) {

What’s left is only the delegate methods which we need to react to view updates or session errors and interruptions:

func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {
    renderer.drawRectResized(size: size)
func draw(in view: MTKView) {
func session(_ session: ARSession, didFailWithError error: Error) {}
func sessionWasInterrupted(_ session: ARSession) {}
func sessionInterruptionEnded(_ session: ARSession) {}

Let’s move to the Renderer.swift file now. The first thing to notice is the use of a very handy protocol that will give us access to all the MTKView properties we need for the draw call later:

protocol RenderDestinationProvider {
    var currentRenderPassDescriptor: MTLRenderPassDescriptor? { get }
    var currentDrawable: CAMetalDrawable? { get }
    var colorPixelFormat: MTLPixelFormat { get set }
    var depthStencilPixelFormat: MTLPixelFormat { get set }
    var sampleCount: Int { get set }

Now you can simply extend the MTKView class (in ViewController) so it conforms to this protocol:

extension MTKView : RenderDestinationProvider {}

To have a high level view of the Renderer class, here is the pseudocode:

init() {
func update() {

As always, we first setup the pipeline, here with the setupPipeline() function. Then, in setupAssets()we create our model which will be loaded every time we use our tap gesture recognizer. The MTKViewdelegate will call the update() function for the needed updates and draw calls. Let’s look at each of them in detail. First we have updateBufferStates() which updates the locations we write to in our buffers for the current frame (we use a ring buffer with 3 slots in this case):

func updateBufferStates() {
    uniformBufferIndex = (uniformBufferIndex + 1) % maxBuffersInFlight
    sharedUniformBufferOffset = alignedSharedUniformSize * uniformBufferIndex
    anchorUniformBufferOffset = alignedInstanceUniformSize * uniformBufferIndex
    sharedUniformBufferAddress = sharedUniformBuffer.contents().advanced(by: sharedUniformBufferOffset)
    anchorUniformBufferAddress = anchorUniformBuffer.contents().advanced(by: anchorUniformBufferOffset)

Next, in updateSharedUniforms() we update the shared uniforms of the frame and set up lighting for the scene:

func updateSharedUniforms(frame: ARFrame) {
    let uniforms = sharedUniformBufferAddress.assumingMemoryBound(to: SharedUniforms.self)
    uniforms.pointee.viewMatrix = simd_inverse(
    uniforms.pointee.projectionMatrix = viewportSize, orientation: .landscapeRight, zNear: 0.001, zFar: 1000)
    var ambientIntensity: Float = 1.0
    if let lightEstimate = frame.lightEstimate {
        ambientIntensity = Float(lightEstimate.ambientIntensity) / 1000.0
    let ambientLightColor: vector_float3 = vector3(0.5, 0.5, 0.5)
    uniforms.pointee.ambientLightColor = ambientLightColor * ambientIntensity
    var directionalLightDirection : vector_float3 = vector3(0.0, 0.0, -1.0)
    directionalLightDirection = simd_normalize(directionalLightDirection)
    uniforms.pointee.directionalLightDirection = directionalLightDirection
    let directionalLightColor: vector_float3 = vector3(0.6, 0.6, 0.6)
    uniforms.pointee.directionalLightColor = directionalLightColor * ambientIntensity
    uniforms.pointee.materialShininess = 30

Next, in updateAnchors() we update the anchor uniform buffer with transforms of the current frame’s anchors:

func updateAnchors(frame: ARFrame) {
    anchorInstanceCount = min(frame.anchors.count, maxAnchorInstanceCount)
    var anchorOffset: Int = 0
    if anchorInstanceCount == maxAnchorInstanceCount {
        anchorOffset = max(frame.anchors.count - maxAnchorInstanceCount, 0)
    for index in 0..<anchorInstanceCount {
        let anchor = frame.anchors[index + anchorOffset]
        var coordinateSpaceTransform = matrix_identity_float4x4
        coordinateSpaceTransform.columns.2.z = -1.0
        let modelMatrix = simd_mul(anchor.transform, coordinateSpaceTransform)
        let anchorUniforms = anchorUniformBufferAddress.assumingMemoryBound(to: InstanceUniforms.self).advanced(by: index)
        anchorUniforms.pointee.modelMatrix = modelMatrix

Next, in updateCapturedImageTextures() we create two textures from the provided frame’s captured image:

func updateCapturedImageTextures(frame: ARFrame) {
    let pixelBuffer = frame.capturedImage
    if (CVPixelBufferGetPlaneCount(pixelBuffer) < 2) { return }
    capturedImageTextureY = createTexture(fromPixelBuffer: pixelBuffer, pixelFormat:.r8Unorm, planeIndex:0)!
    capturedImageTextureCbCr = createTexture(fromPixelBuffer: pixelBuffer, pixelFormat:.rg8Unorm, planeIndex:1)!

Next, in updateImagePlane() we update the texture coordinates of our image plane to aspect fill the viewport:

func updateImagePlane(frame: ARFrame) {
    let displayToCameraTransform = frame.displayTransform(withViewportSize: viewportSize, orientation: .landscapeRight).inverted()
    let vertexData = imagePlaneVertexBuffer.contents().assumingMemoryBound(to: Float.self)
    for index in 0...3 {
        let textureCoordIndex = 4 * index + 2
        let textureCoord = CGPoint(x: CGFloat(planeVertexData[textureCoordIndex]), y: CGFloat(planeVertexData[textureCoordIndex + 1]))
        let transformedCoord = textureCoord.applying(displayToCameraTransform)
        vertexData[textureCoordIndex] = Float(transformedCoord.x)
        vertexData[textureCoordIndex + 1] = Float(transformedCoord.y)

Next, in drawCapturedImage() we draw the camera feed in the scene:

func drawCapturedImage(renderEncoder: MTLRenderCommandEncoder) {
    guard capturedImageTextureY != nil && capturedImageTextureCbCr != nil else { return }
    renderEncoder.setVertexBuffer(imagePlaneVertexBuffer, offset: 0, index: 0)
    renderEncoder.setFragmentTexture(capturedImageTextureY, index: 1)
    renderEncoder.setFragmentTexture(capturedImageTextureCbCr, index: 2)
    renderEncoder.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: 4)

Finally, in drawAnchorGeometry() we draw the anchors for the virtual content we created:

func drawAnchorGeometry(renderEncoder: MTLRenderCommandEncoder) {
    guard anchorInstanceCount > 0 else { return }
    renderEncoder.setVertexBuffer(anchorUniformBuffer, offset: anchorUniformBufferOffset, index: 2)
    renderEncoder.setVertexBuffer(sharedUniformBuffer, offset: sharedUniformBufferOffset, index: 3)
    renderEncoder.setFragmentBuffer(sharedUniformBuffer, offset: sharedUniformBufferOffset, index: 3)
    for bufferIndex in 0..<mesh.vertexBuffers.count {
        let vertexBuffer = mesh.vertexBuffers[bufferIndex]
        renderEncoder.setVertexBuffer(vertexBuffer.buffer, offset: vertexBuffer.offset, index:bufferIndex)
    for submesh in mesh.submeshes {
        renderEncoder.drawIndexedPrimitives(type: submesh.primitiveType, indexCount: submesh.indexCount, indexType: submesh.indexType, indexBuffer: submesh.indexBuffer.buffer, indexBufferOffset: submesh.indexBuffer.offset, instanceCount: anchorInstanceCount)

Back to the setupPipeline() function which we briefly mentioned earlier. We create two render pipeline state objects, one for the captured image (the camera feed) and one for the anchors we create when placing virtual objects in the scene. As expected, each of the state objects will have their own pair of vertex and fragment functions – which brings us to the last file we need to look at – the Shaders.metalfile. In the first pair of shaders for the captured image, we pass through the image vertex’s position and texture coordinate in the vertex shader:

vertex ImageColorInOut capturedImageVertexTransform(ImageVertex in [[stage_in]]) {
    ImageColorInOut out;
    out.position = float4(in.position, 0.0, 1.0);
    out.texCoord = in.texCoord;
    return out;

In the fragment shader we sample the two textures to get the color at the given texture coordinate after which we return the converted RGB color:

fragment float4 capturedImageFragmentShader(ImageColorInOut in [[stage_in]],
                                            texture2d<float, access::sample> textureY [[ texture(1) ]],
                                            texture2d<float, access::sample> textureCbCr [[ texture(2) ]]) {
    constexpr sampler colorSampler(mip_filter::linear, mag_filter::linear, min_filter::linear);
    const float4x4 ycbcrToRGBTransform = float4x4(float4(+1.0000f, +1.0000f, +1.0000f, +0.0000f),
                                                  float4(+0.0000f, -0.3441f, +1.7720f, +0.0000f),
                                                  float4(+1.4020f, -0.7141f, +0.0000f, +0.0000f),
                                                  float4(-0.7010f, +0.5291f, -0.8860f, +1.0000f));
    float4 ycbcr = float4(textureY.sample(colorSampler, in.texCoord).r, textureCbCr.sample(colorSampler, in.texCoord).rg, 1.0);
    return ycbcrToRGBTransform * ycbcr;

In the second pair of shaders for the anchor geometry, in the vertex shader we calculate the position of our vertex in clip space and output for clipping and rasterization, then color each face a different color, then calculate the positon of our vertex in eye space and finally rotate our normals to world coordinates:

vertex ColorInOut anchorGeometryVertexTransform(Vertex in [[stage_in]],
                                                constant SharedUniforms &sharedUniforms [[ buffer(3) ]],
                                                constant InstanceUniforms *instanceUniforms [[ buffer(2) ]],
                                                ushort vid [[vertex_id]],
                                                ushort iid [[instance_id]]) {
    ColorInOut out;
    float4 position = float4(in.position, 1.0);
    float4x4 modelMatrix = instanceUniforms[iid].modelMatrix;
    float4x4 modelViewMatrix = sharedUniforms.viewMatrix * modelMatrix;
    out.position = sharedUniforms.projectionMatrix * modelViewMatrix * position;
    ushort colorID = vid / 4 % 6;
    out.color = colorID == 0 ? float4(0.0, 1.0, 0.0, 1.0)  // Right face
              : colorID == 1 ? float4(1.0, 0.0, 0.0, 1.0)  // Left face
              : colorID == 2 ? float4(0.0, 0.0, 1.0, 1.0)  // Top face
              : colorID == 3 ? float4(1.0, 0.5, 0.0, 1.0)  // Bottom face
              : colorID == 4 ? float4(1.0, 1.0, 0.0, 1.0)  // Back face
              :                float4(1.0, 1.0, 1.0, 1.0); // Front face
    out.eyePosition = half3((modelViewMatrix * position).xyz);
    float4 normal = modelMatrix * float4(in.normal.x, in.normal.y, in.normal.z, 0.0f);
    out.normal = normalize(half3(;
    return out;

In the fragment shader, we calculate the contribution of the directional light as a sum of diffuse and specular terms, then we compute the final color by multiplying the sample from the color maps by the fragment’s lighting value and finally use the color we just computed and the alpha channel of the color map for this fragment’s alpha value:

fragment float4 anchorGeometryFragmentLighting(ColorInOut in [[stage_in]],
                                               constant SharedUniforms &uniforms [[ buffer(3) ]]) {
    float3 normal = float3(in.normal);
    float3 directionalContribution = float3(0);
        float nDotL = saturate(dot(normal, -uniforms.directionalLightDirection));
        float3 diffuseTerm = uniforms.directionalLightColor * nDotL;
        float3 halfwayVector = normalize(-uniforms.directionalLightDirection - float3(in.eyePosition));
        float reflectionAngle = saturate(dot(normal, halfwayVector));
        float specularIntensity = saturate(powr(reflectionAngle, uniforms.materialShininess));
        float3 specularTerm = uniforms.directionalLightColor * specularIntensity;
        directionalContribution = diffuseTerm + specularTerm;
    float3 ambientContribution = uniforms.ambientLightColor;
    float3 lightContributions = ambientContribution + directionalContribution;
    float3 color = in.color.rgb * lightContributions;
    return float4(color, in.color.w);

If you run the app, you should be able to tap on the screen to add cubes on top of your live camera view, and move away or closer or around the cubes to see their different colors on each face, like this:

In the next part of the series we will look more into Tracking and Scene Understanding and see how plane detection, hit-testing, collisions and physics can make our experience even greater. The source code is posted on Github as usual.

Until next time!

Leave a Reply

Your email address will not be published. Required fields are marked *