I enjoy talking with CEVA because they work on such interesting consumer products (among other product lines). My most recent discussion was with Seth Sternberg (Sensors and Audio software at CEVA), on spatial or 3D audio. The first steps to a somewhat immersive audio experience were stereo and surround sound, placing sound sources around the listener. A little better than mono audio, but your brain interprets the sound as coming from inside and fixed to your head, because it’s missing important cues like reverb, reflection, and timing differences at each ear. 3D audio recreates those cues, allowing the brain to feel the sound source is outside your head but still fixed to your head; move your head to the left and the band moves to the left, move to the right and the band moves to the right Connecting head movements to the audio corrects this last problem, fixing the sound source in place. When you move your head you hear a change in the same way you would in the real world. This might seem like a nice-to-have but it has major implications in user experience and in reducing fatigue induced by lesser implementations.
Why should we care?
Advances in this domain leverage large markets, especially in gaming (~$300B), which doesn’t just drive game sales. If you doubt gaming is important, remember that last year gaming led NVIDIA revenues and is still a major contributor. As a further indicator the headphones/earphones market is already above $34B and expected to grow to $126B by 2030. Apple and Android 13 provide proprietary spatial audio solutions for music and video services and are already attracting significant attention. According to one reviewer there are already thousands of Apple Music songs encoded for 3D. Samsung calls their equivalent 360 Audio, working with their Galaxy Buds Pro and content encoded for Dolby Atmos (also supported by Apple’s Spatial Audio). Differentiating on the user audio experience is a big deal.
The music option is interesting but I want to pay special attention to gaming. Given an appealing game, the more immersive the experience the more gamers will be drawn to that title. This depends in part on video action of course, but it also depends on audio well synchronized both in time and in player pose with the video. You want to know the difference between footsteps behind you or in front. When you turn your head to confirm, you expect the audio to track with your movement. If you look up at a helicopter flying overhead, the audio should track. Anything less will be unsatisfying.
Though you may not notice at first, poor synchronization in timing and pose can also become tiring. Your brain tries to make sense of what should be correlated visual and audible stimuli. If these don’t correspond, it must work harder to make them align. An immersive experience should enhance excitement, not fatigue, and game makers know it. Incidentally, long latencies and position mismatch between visual and audio stimuli are also thought to be a contributing factor in Zoom fatigue. Hearing aid wearers watch a speaker’s lips for clues to reinforce what they are hearing; they also report fatigue after extended conversation.
In other words, 3D audio is not a nice-to-have. Product makers who get this right will crush those who ignore the differentiation it offers.
To encode or not to encode
In the early days of surround sound, audio from multiple microphones was encoded in separate channels, ultimately decoded to separate speakers in your living room. Then “up-mixing” was introduced using cues from the audio to infer a reasonable assignment of source directions to support 5.1 or 7.1 surround sound. This turns out to be a pretty decent proxy for pre-encoding and certainly is much cheaper than re-recording and encoding original content in multiple channels. If there is more information like stereo, a true 5.1, 7.1 or ambisonics, 3D audio should start with that. Otherwise up-mixing provides a way for 3D audio to deliver a good facsimile of the real thing.
The second consideration is where to render the audio, on the phone/game station or in the headset. This is relevant to head tracking and latency. Detecting head movements obviously must happen in the headset but most commonly the audio rendering is handled in the phone/gaming device. Sending head movement information back from the headset to the renderer adds latency on top of rendering. This roundtrip over Bluetooth can add up to 200-400 milliseconds, a very noticeable delay between visual and audible streams. Apple has some proprietary tricks to work around this issue but these are locked into an Apple exclusive ecosystem.
The ideal and open solution is to do the audio rendering and motion detection in the headset for minimal total latency.
The RealSpace solution
In May of this year, CEVA acquired the VisiSonics spatial audio business. They have integrated this together with the CEVA MotionEngine software for dynamic head tracking, providing precisely the solution defined above. They also provide plugins for game developers who want to go all the way to delivering content fully optimized to 3D audio. The product is already integrated in chips from a couple of Chinese semis and a recently released line of hearables in India. Similar announcements are expected in other regions.
Very cool technology. You can read about the acquisition HERE, and learn more about the RealSpace product HERE.
Also Read:
DSP Innovation Promises to Boost Virtual RAN Efficiency
All-In-One Edge Surveillance Gains Traction
CEVA’s LE Audio/Auracast Solution
Share this post via:
The Intel Common Platform Foundry Alliance