When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).
You might be interested in project HeadOn at TU München:
Justus Thies gave a presentation at our University about a year ago. IIRC they don't use any fancy NN stuff but instead extract the face geometry using a stereo camera and use interpolation to project the movement onto a target mesh. Using stereo goggles for a VR meeting and various other applications were discussed during the presentation, but the main focus was of course on entertaining the audience with fake videos.
On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.
At some point you can probably reconstruct large portions of scenes in existing movies and change perspectives, especially if you had good techniques (perhaps AI based) for filling occlusions in the data.
Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?
To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?
This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.
This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.
Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.
I like real voice but I could see if you were generating lots of videos and doing so in multiple languages, you could abstract that away a bit and dynamically generate videos these days. This is part of the reason video containers are often separated into components/layers (for video and audio tracks). I dont see why you couldn't also have the subtitle data read and generate the audio dynamically based on language. Some of this probably already happens somewhere by some group. Just an idea that I found interesting, similar to composing documents with LaTeX etc. Think of the audio as the "presentation" layer for a lot of visual frameworks used and think of similar structures for audio. It's especially useful for videos where the speaker isn't visible so syncing audio with lip movements across languages isn't a problem.
Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.
Although I can get rid of it if I focus.
Is this really happening with a "deep network in the browser"? It looks like it is happening on a server, then the 3D result is viewed in a browser.
This effect is well-studied and also happens with a real physical mask when viewed from the inside: