Voice activated control, search, entertainment and other capabilities are building momentum rapidly. This seems inevitable – short of Elon Musk’s direct brain links, the fastest path to communicate intent to a machine is through methods natural to us humans: speech and gestures. And since for most of us speech is a richer channel, there has been significant progress on voice recognition, witness Amazon Echo, Google Home and voice-based commands in your car.
Of course there’s a lot of significant technology behind that capability and CEVA is an important player in making that happen. As one example, when you say “OK Google” into your Galaxy S7, a CEVA DSP core inside a DSPG chip provides the platform to listen for and process that command. According to CEVA, the reason Samsung chose that solution over an implementation in the Snapdragon A20 ultra-low power island was that the CEVA/DSPG implementation is even lower power, allowing for always-on listening, even when the screen is off.
Always-on listening is one of several important factors in making voice-control ubiquitous. CEVA recently hosted a webinar, jointly with Alango Technologies, to provide insight into their solutions in this important space. In (acoustic) near-field applications such as in a smartphone, or even in smart microphones, ultra-low power is obviously important. CEVA promotes the CEVA-TL410 for ultra-low power in always-on near-field applications, such as the voice-sensing application used in the Galaxy S7.
The primary focus of this webinar was on high performance applications such as smart speakers / assistants. Here far-field performance become very important, contending with long distances (10 meters in one example), ambient noise, echoes, reverberation and potentially multiple voices. Here CEVA discussed application of the CEVA-X2. According to Eran Belaish (Product Marketing for audio, voice and sensing), building the speaker part of such a device is relatively straightforward. Complexity comes in building the smart part where there is a need to provide sophisticated processing for acoustic echo cancellation and noise reduction, and beamforming from an array of microphones to support intelligent voice processing.
Eran broke down the structure of the audio and sensing part of this solution first into voice activity detection (VAD), like the ultra-low power solution mentioned above. This is followed by PDM to PCM conversion, also in hardware, and then the real smarts for far-field support in a range of audio/voice functions running on the DSP. With a CEVA-based VAD, you still start with ultra-low standby power, which you’ll see later is an important advantage.
The company has an impressive slate of ecosystem partners to provide this functionality (together with their own software of course). Alango presented their software solution for acoustic echo cancellation, beamforming and noise reduction, in their voice enhancement package (VEP) running on the CEVA-X2 platform. All far-field solutions today use multiple microphones for 360[SUP]o[/SUP] coverage, from as few as two to as many as 8, and more can be supported, so this is where high quality voice processing must start. VEP manages echo cancellation for each microphone, then beamforms to produce as many beams as required from the set of microphones (perhaps 8 beams from 4 microphones) and then optionally performs noise suppression on each beam. These beams are then passed on to the automatic speech recognition (ASR) or keyword recognition (KWR) software.
Alango presented impressive results of experiments they ran to show improvements in voice trigger recognition rates (as detected by Sensory voice trigger technology for a trigger like “OK Google”) at varying distances and in the presence of noise as the number of microphones increased. Clearly adding more microphones, together with VEP, greatly improves detection accuracy at distance in noisy environments. That’s why the latest revs of Amazon Echo have 7 microphones.
Bu there’s a problem for existing implementations. Eran talked in the Q&A about the Amazon assistants. Many of these devices are wired – they must connect to a power outlet. This supports always-listening mode but isn’t friendly to portability. Amazon introduced the Tap to offer portability, but portable means battery powered, requiring low power when standing-by, which is why you must tap the device to turn it on before it will start listening. Still, the battery would last a few months in this usage. But tapping isn’t very convenient either, so Amazon released a software update which eliminated the need for a tap – the device was always listening. Unfortunately battery life dropped to 8 hours!
DSPG (whose ultra-low power solution is based on CEVA, see above), demonstrated together with another partner (Vesper) for microphones that they could replace the tap detector with the always-on KWR solution described above, running of course off the same device battery. Battery life shot back up to 3 months. This is impressive; in effect, always-on KWR using this technology consumes negligible power compared to power consumption during active use.
There’s a lot more you can learn about in the webinar. CEVA have a demo unit, there was discussion on voice isolation (differentiating between different speakers speaking at the same time but at different locations), voice biometrics / voice-printing and many other topics requiring more AI, natural language recognition, perhaps more sensor fusion to combine vision recognition with voice/speech recognition to refine inferences. Eran noted that advances here are being worked, at CEVA and other places, but aren’t commercially available today. Still, all of this points very much to Eran’s opening position about the future of voice in electronics – it’s very bright!