Will Driscoll talks about the volumetric video tools and services his company Wild Capture develop to make web content more relevant for users and, ultimately, populate the metaverse.
Web 3.0 is fast approaching, bringing with it many expectations. People are looking ahead to semantic content delivered by search engines that use AI to gain their own understanding of keywords and results, and to decentralised applications and services supplied from the edge in a distributed approach, without a central authority. Certainly, they are looking for a clearer idea of the metaverse, still at a stage where it is defined differently by each person you talk to.
Across all of these expectations is the goal of making web content more relevant for users. For example, a website using AI should be able to filter through and rapidly supply the data it thinks a specific user will find appropriate. As Web 3.0 continues to emerge, content creators like Will Driscoll, CEO at Wild Capture, are excited about the promise of creating high-quality messaging – communication, marketing, experiences – in virtual spaces using digital humans.
However, until now, digital human performances, and achieving lifelike digital human experiences, have been held back by various technological issues. The important question for many creators is how to actually get the content into the immersive space to view and enjoy. Furthermore, as the scale of 3D applications and technologies increases, 3D digital humans will be needed that can blend directly into the many, varied pipelines that utilise spatial media.
If 3D web applications are to mature as expected, Will believes that volumetric video is the way forward for digital humans.
Volumetric Video Production
Wild Capture develops spatial media products used to produce digital humans using volumetric techniques and industry-standard practices. Volumetric video production uses multiple cameras and sensors to record a subject, recording enough information to create a 3-dimensional (full-volume) set of data of the subject, rather than a flat image. On set, the cameras need to be rigged uniformly around the subject in order to capture it in motion from all angles. In post, each camera is defined and the combined output from the array synced and processed to produce a 3D mesh of that subject.
It’s an interesting process but fairly complex and therefore, what happens next – adding the 3D mesh to different environments, game engines and virtual worlds – is what interests Wild Capture the most, and what they want their customers to be able to focus on. Ultimately, their goal is to help users produce inspiring, engaging content and realistic digital humans for use in media production, web-based applications and software development.
Their team carries out performance capture, machine learning workflows – in real-time – and production services for each stage of the volumetric process. Earlier this year, they introduced their Digital Human Platform, which makes video production tools and consulting services available for volumetric digital humans and crowds – essentially virtual world-building for the metaverse and games, VR experiences and XR, specialised performance capture and other applications.
The Wild Capture team working on a volumetric video stage.
Bounded Motion
To capture realistic 3D immersive performances, the team has found that a major challenge facing their artists when creating digital humans in 3D environments, is boundaries. Volumetric video begins with a series of 3D shots, captured over time, with no simple way to connect the shots to each other or to translate the data to the real world. In its raw form, the model exists as a moving human shape that other elements cannot interact with. The texture colour on the 3D object frequently changes as well, with little consistency.
Will said, “Without the physical elements that most 3D animated characters possess, the result is not so compelling. A traditional 3D character has consistency throughout the animation from the first frame to the last, and deforms recognisably over time, with keyframes to identify the mesh in place and time. Subsequent frames use this keyframe mesh to compare their own space in time and place to identify how to deform that mesh.
“With volumetric video, the keyframe mesh changes every 5 to 10 frames depending on how much the character topology changes. These ongoing keyframe changes help keep the model accurate, and meanwhile the texture changes to match the actual captured footage.”
Animation Layers
In order to produce creative projects from volumetric performances, artists need tools that manipulate the real-life footage with an animation layer on top of the physical performance. “The animation layer is essentially an invisible layer that exists just above the surface of the character. It functions as a boundary layer with a push volume attached to that character,” Will said.
“The actual performance from the data set that we generate, which includes markerless motion capture, contains all the features of the traditional character starting with realistic velocity to interact, without any CG effects or objects. We use animation layers to add deformation to the original performance capture in a controlled way, for instance when characters need adjustment to interact in an environment or with other performances. The rest position is defined as the pose of the volumetric performance in the current frame.”
As a volume-based technique, the process allows the character’s entire shape to accurately deform in space, managing the same data set throughout the entire performance, whether it’s a 10 second clip or 5 minute music performance. A character could also respond to dynamic interaction similar to a ragdoll or softbody grain simulation.
Mesh, Texture and a Uniform Solve
“We develop ways to create a uniform solve of the mesh topology to the structure of the character in a smart way that allows us to edit the texture. That texture output and that uniform mesh are optimised across the entire 3D asset. We can do a lot more with this sorted data than with the original raw data,” Will noted. “The original data is agnostic volumetric 3D mesh-and-texture sequences, still in a common format for OBJ, ABC and PLY files for the 3D mesh.
By separating the texture layer from the mesh, the artist can adjust them separately and blend the asset into the environment, or deliver at a better resolution for the required pipeline. Wild Capture calls it a smart character because they have begun using the sorted ‘uniform’ data to generate interactive AI characters triggered through states.
CEO Will Driscoll, Wild Capture
He said, “The importance of how all of this comes together is that the process of recording volumetric video, and then using our uniform solve, automates animation of the body and face. Our markerless motion capture process attaches a bone structure that allows rigged characters to be re-animated if necessary. But with our process, the character is automatically animated upon delivery.”
Sorted and Synched
The sorted uniform mesh data allows for proper collisions, asset level of detail baking and exporting/playback. Much more can be done with more datasets if they can be synced up with other software and tools. Once all of the performance data has been lined up and works together, new possibilities become available. For example, bone structures can inform other parts of the body. Also, higher frequency details can be obtained with non RGB cameras and then used for more accurate results.
Also, as data pipelines have widened, Wild Capture’s use of volumetric video and a USD pipeline helps to manage the synchronization of mesh and texture for a mix of variant versions and timing for randomization. USD also manages the entire dataset per actor/character, allowing the data to result in a single USD file to drive all assets per actor.
Digital Human Platform
Returning to Wild Capture’s new Digital Human Platform, the software on the Platform includes Cohort, a set of crowd creation and digital fashion tools, and the Universal Volumetric Player (UVol), a free open-source codec developed in partnership with virtual world-building companies HyperConstruct and XR Foundation, and used for playback.
“Realistic digital humans in 3D environments that can be navigated by hundreds, even thousands of users simultaneously are the next phase of entertainment and media messaging,” Will said. “Volumetric video adds a depth of immersion to video that is difficult to produce by other means.”
According to Griffin Rowel from HyperConstruct and XR Foundation, people have attempted to create virtual worlds for many years but few of them have gained a following. He believes the problem is due to the content, which he describes as “volumetric digital beings portrayed as entities without a home, and virtual worlds as places with nothing to do and no one to see.”
Seeing Yourself
Wild Capture aims to generate content that engages and is full of potential – in other words, better able to meet viewers’ expectations of what a metaverse can be. Through HyperConstruct’s partnership with Wild Capture, digital humans can feel tangible in the created spaces.
“A person can see themselves existing as a live model with different digital clothing, and see how the cloth moves and how it wears on the body. At the same time, opportunities open up for brands to fit digital humans into everyday workflows,” Griffin said.
The key factor for creators is making the audience feel that they are engaging with a true representation of their own human experience. Realism and authenticity in recreated human experiences are essential when forming connections with a user base, and fortunately, the assets artists are creating today are highly believable in looks and movement, and the bandwidths accessible to consumers are becoming large enough to support it.
Clothing Interaction
The Cohort Crowd Kit customises volumetric human performance libraries. The performances can then be duplicated, customised and randomised, and used to populate virtual worlds. Within this Crowd kit, the Digital Fashion Toolkit supplies fashion designers – either real or virtual – with a pipeline for lifelike CG cloth based on real world physics.
It is realised with its own boundary layer that is rigged to fit the body with lifelike physics. Cohort fills this layer with digital cloth capable of customising itself to the performance and body shape. It can also be used as a VFX layer within a 3D world. It solves the key interactions with volumetric characters by adjusting cloth dimensions according to each unique character model. All digital fabric is created from traditional CG cloth designs, then applied and adjusted for collisions and real world physics in the virtual world.
These processes can, in turn, be integrated with digital human assets and libraries that use data pipelines for new media production, app and software development, and web applications. The Cohort Crowd Kit app and plug-in also interacts with world engines, making it possible to pre-render a crowd’s capabilities, complete with volumetric fashions.
In partnership with Savitude, a California clothing design company, Wild Capture has demonstrated the Digital Cloth Tool's cloth render capabilities for Savitude's digital fashion designs, producing volumetric content for a diverse range of digital humans.
Wild Capture was also a part of London Fashion Week, 21 to 28 September, featuring in a metaverse fashion experience from The Immersive KIND, intended to reflect the future of the fashion industry. Bringing together designers, artists and musicians, Wild Capture produced custom digital human performance models tuned for various fashion designers in an authentic Web 3.0 style. Wild Capture believes their Cohort and Digital Cloth Tool can play a major role in marketing and delivering commercial messages to audiences in a specific, personal way.
USD Interoperability and Variation
Given the commercial world that Wild Capture is targeting, practical considerations emerge continuously and are a priority. The ways Will and Wild Capture have developed to stream web assets using volumetric video with USD are fairly recent.
Will said, “The main effect that USD has had on our work, other than the infinite variable options on how to change an asset, is the ability to render out thousands of characters from the same dataset without needing to create the data 1,000 times over.
“Essentially, it streamlines our render process by allowing non-destructive data to create the variants instead of having to render each one from scratch. For a company that works with data sets as large as the ones we render, this ability changes the game.”
USD is now core to their Cohort product’s interoperability. They believe that USD will continue to change the path of computer processing and usability for some time, and will be investing in the ongoing development of the language as it relates to volumetric video.
“Our focus now is on developing new ways to translate the nuances of humans to their digital doubles and building familiar workflows for artists to speed up their processes of populating crowds or treating the look of a character,” said Will. “In combination with our Cohort Crowd tools and our Uniform Solve system, we are also working on ways to allow our assets to work more effectively in various render engines.
“Interaction with environments is already achieved pretty well, so we’re focussing on the look side of the integration, by introducing dramatic lighting layouts that usually haven’t worked well with this content.”
Universal Volumetric Player
Wild Capture has also been working at the distribution end of the volumetric pipeline. The Universal Volumetric Player (UVol) is a free, open-source codec and volumetric streaming web player, available for interactive playback of digital humans and other volumetric data for the web, game engines such as Unity and Unreal, and native platforms. The format, following two years of development between Wild Capture, XR Foundation and HyperConstruct, achieves this by optimising mesh compression and combining it with streaming MP4 texture sequences.
A stable responsive player, it supports long clip times and includes camera control. The current version uses the Corto compression format from CNR-ISTI Visual Computing Group in Europe, which has fast compression and especially fast decompression characteristics. Users can view captured 3D performances, move the digital camera anywhere in real-time, and then, based on high-quality solved assets instead of regular 2D video, compare the takes and make final decisions. This capability gives creators the chance to create complete metaverse environments, for example, and output them on mobile or VR devices.
www.wildcapture.io