WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 386
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 386
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)
            
Mobile Unleashed Banner SemiWiki
WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 386
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 386
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)

Better Speech Recognition by Reducing Babble

Better Speech Recognition by Reducing Babble
by Bernard Murphy on 11-17-2020 at 6:00 am

I’ve become a bit of a connoisseur of voice-based control, so when Chris Rowen did a pitch on Babble Labs at Arm Dev Summit last month, I wanted to listen in.  Chris was the CEO of Babble Labs, recently acquired by the Cisco Webex group where he’s now listed as VP Engineering of the Voice Technology Group. You should expect to see this technology appearing at some point on a conference call near you. This acquisition also firmly established Chris as a serial entrepreneur. Will be interesting to see what he tries next. He’s certainly impressed Cisco with better speech recognition.

Better Speech Recognition

AI-based transcription

One type of voice-based control is provided in transcription – converting speech to text.  Cloud-hosted transcription services, eg from IBM Watson, need input speech with little background noise. I use such a service to transcribe interviews. It does a good enough job for my purposes, with maybe ~80% accuracy. But what if you want to use voice-based control in a noisy environment? First responders and health-care workers can’t depend on low-noise. If you’re in a restaurant, or shopping, walking through a city, in a busy home or office environment you can’t control noise. You don’t need to transcribe long emails or book chapters in these environments, but basic control through voice commands is highly desirable. Particularly for those emergency responders whose hands are otherwise occupied.

Noise mitigation

Mitigation starts with voice pickup. Audio zoom is becoming popular. Detect where the speaker is and use beamforming to zoom in on that speaker while suppressing audio noise from other sources. Then there’s audio echo cancellation. In a room audio signals can echo off walls and furniture. This will detect and suppress delayed echoes. Adaptive noise cancellation is another trick using inward-facing and outward-facing mics, to guide generating a cancelling waveform. Primarily valuable for headphones and earphones, or in a car cabin. This technique works well for road noise, train noise, fans, equipment rumble, noises which are relatively low frequency and fairly steady. Chris calls this stationary noise. Stationary noise cancellation doesn’t work so well for a baby crying, a dog barking, somebody typing, background speech, as anyone who has used noise-cancelling headphones will know. He calls this dynamic noise.

Mitigating babble noise

This kind of noise is common, much more common than those quiet transcription settings. At an accident, in an emergency room and so on. And these noises fall squarely in the frequency range of human speech. If you filter them out, you’ll also lose the voice of the person giving commands. This takes a different approach.

Chris mentioned a shift in command recognition, once speech is captured, from recognizing phonemes to words to command structure, to instead directly recognizing full commands. My impression is that this is already fairly widely applied at the edge. The probabilistic nature of deep learning recognition gives a higher probability of recognizing a full command directly than in a sequence of recognition steps.

More importantly, Babble has put a lot of work into recognizing “babble”. Those background noises that interfere with commands. This they also do through learning and are able to recognize commands against a much higher background of noise that transcription engines are able to handle. Chris showed several examples. As a reference, at 20 dB signal to noise, the kind of quiet environment recommended for transcription, IBM Watson recognized a set of commands just as well as Babble Labs. At 4 dB, Watson only got one command right and Babble Labs still performed perfectly. At 0 dB, Watson got nothing right and Babble Labs still got all commands right. Only at -3 dB did Babble Labs start to get some commands wrong.

Richard Burton of Arm followed with a discussion of running the Babble Labs capability on a variety of Arm platforms. He demonstrated stats from general purpose Cortex M7 all the way up to Ethos U55, showing a 160X speedup.

Cool stuff. I’d point you to the Babble Labs website, but it’s probably more useful now to point you to the Cisco press release.

Share this post via:

Comments

One Reply to “Better Speech Recognition by Reducing Babble”

You must register or log in to view/post comments.