Augmented Reality provides a way of overlaying virtual content on top of real world views usually obtained from a mobile device camera. Last month at WWDC 2017 we were all thrilled to see Apple’s new ARKit framework which is a high level API that works with A9-powered devices or newer, running on iOS 11. Some of the ARKit experiments we’ve already seen are outstanding, such as this one below:
There are three distinct layers in an ARKit application:
Tracking - no external setup is necessary to do world tracking using visual inertial odometry.
Scene Understanding - the ability of detecting scene attributes using plane detection, hit-testing and light estimation.
Rendering - can be easily integrated because of the template AR views provided by SpriteKit and SceneKit but it can also be customized for Metal. All the pre-render processing is done by ARKit which is also responsible for image capturing using AVFoundation and CoreMotion.
In this first part of the series we will be looking mostly at Rendering in Metal and talk about the other two stages in the next part of this series. In an AR application, the Tracking and Scene Understanding are handled entirely by the ARKit framework while Rendering can be handled by either SpriteKit, SceneKit or Metal:
To get started, we need to have an ARSession instance that is set up by an ARSessionConfiguration object. Then, we call the run() function on this configuration. The session also has AVCaptureSession and CMMotionManager objects running at the same time to get image and motion data for tracking. Finally, the session will output the current frame to an ARFrame object:
The ARSessionConfiguration object contains information about the type of tracking the session will have. The ARSessionConfiguration base configuration class provides 3 degrees of freedom tracking (the device orientation) while its subclass, ARWorldTrackingSessionConfiguration, provides 6 degrees of freedom tracking (the device position and orientation).
When a device does not support world tracking, it falls back to the base configuration:
An ARFrame contains the captured image, tracking information and well as scene information via ARAnchor objects that contain information about real world position and orientation and can be easily added, updated or removed from sessions. Tracking is the ability to determine the physical location in real time. The World Tracking however, determines both position and orientation, it works with physical distances, it’s relative to the starting position and provides 3D-feature points.
The last component of an ARFrame are ARCamera objects which facilitate transforms (translation, rotation, scaling) and carry tracking state and camera intrinsics. The quality of tracking relies heavily on uninterrupted sensor data, static scenes and is more accurate when scenes have textured environment with plenty of complexity. Tracking state has three values: Not Available (camera only has the identity matrix), Limited (scene has insufficient features or is not static enough) and Normal (camera is populated with data). Session interruptions are caused by camera input not being available or when tracking is stopped:
Rendering can be done in SceneKit using the ARSCNView’s delegate to add, update or remove nodes. Similarly, rendering can be done in SpriteKit using the ARSKView delegate which maps SKNodes to ARAnchor objects. Since SpriteKit is 2D, it cannot use the real world camera position, so it projects the anchor positions into the ARSKView and then renders the sprite as a billboard (plane) at this projected location, so the sprite will always be facing the camera. For Metal, there is no customized AR view so that responsibility falls in programmer’s hands. For processing of rendered images we need to:
draw background camera image (generate a texture from the pixel buffer)
update the virtual camera
update the lighting
update the transforms for geometry
All this information is in the ARFrame object. To access the frame, there are two options: polling or using a delegate. We are going to describe the latter. I took the ARKit template for Metal and stripped it down to a minimum so I can better understand how it works. First thing I did was to remove all the C dependencies so bridging is not necessary anymore. It will be useful in the future to have it in place so types and enum constants can be shared between API code and shaders but for the purpose of this article it is not needed.
Next, on to ViewController which will act as both our MTKView and ARSession delegates. We create a Renderer instance that will work with the delegates for real time updates to the application:
As you can see, we also added a gesture recognizer which we will use to add virtual content to our view. We first get the session’s current frame, then create a translation to put our object in front of the camera (0.3 meters in this case) and finally add a new anchor to our session using this transform:
We use the viewWillAppear() and viewWillDisappear() methods to start and pause the session:
What’s left is only the delegate methods which we need to react to view updates or session errors and interruptions:
Let’s move to the Renderer.swift file now. The first thing to notice is the use of a very handy protocol that will give us access to all the MTKView properties we need for the draw call later:
Now you can simply extend the MTKView class (in ViewController) so it conforms to this protocol:
To have a high level view of the Renderer class, here is the pseudocode:
As always, we first setup the pipeline, here with the setupPipeline() function. Then, in setupAssets() we create our model which will be loaded every time we use our tap gesture recognizer. The MTKView delegate will call the update() function for the needed updates and draw calls. Let’s look at each of them in detail. First we have updateBufferStates() which updates the locations we write to in our buffers for the current frame (we use a ring buffer with 3 slots in this case):
Next, in updateSharedUniforms() we update the shared uniforms of the frame and set up lighting for the scene:
Next, in updateAnchors() we update the anchor uniform buffer with transforms of the current frame’s anchors:
Next, in updateCapturedImageTextures() we create two textures from the provided frame’s captured image:
Next, in updateImagePlane() we update the texture coordinates of our image plane to aspect fill the viewport:
Next, in drawCapturedImage() we draw the camera feed in the scene:
Finally, in drawAnchorGeometry() we draw the anchors for the virtual content we created:
Back to the setupPipeline() function which we briefly mentioned earlier. We create two render pipeline state objects, one for the captured image (the camera feed) and one for the anchors we create when placing virtual objects in the scene. As expected, each of the state objects will have their own pair of vertex and fragment functions - which brings us to the last file we need to look at - the Shaders.metal file. In the first pair of shaders for the captured image, we pass through the image vertex’s position and texture coordinate in the vertex shader:
In the fragment shader we sample the two textures to get the color at the given texture coordinate after which we return the converted RGB color:
In the second pair of shaders for the anchor geometry, in the vertex shader we calculate the position of our vertex in clip space and output for clipping and rasterization, then color each face a different color, then calculate the positon of our vertex in eye space and finally rotate our normals to world coordinates:
In the fragment shader, we calculate the contribution of the directional light as a sum of diffuse and specular terms, then we compute the final color by multiplying the sample from the color maps by the fragment’s lighting value and finally use the color we just computed and the alpha channel of the color map for this fragment’s alpha value:
If you run the app, you should be able to tap on the screen to add cubes on top of your live camera view, and move away or closer or around the cubes to see their different colors on each face, like this:
In the next part of the series we will look more into Tracking and Scene Understanding and see how plane detection, hit-testing, collisions and physics can make our experience even greater. The source code is posted on Github as usual.