There are many “next big thing” possibilities these days in tech but mostly behind the scenes; few are front and center for us as consumers. That is until smart speakers started taking off, led by Amazon Echo/Dot and Google Home. The intriguing thing about this technology is our ability to control stuff without needing to type/tap/click on a device. We can be hands-free, working on something which requires most of our attention yet still get useful information (Alexa, where’s the oil drain plug on a 2014 Honda Civic?). Or we can simply be lazy (Alexa, play Guardians of the Galaxy). Don’t underestimate the power of convenience – a good part of the American economy runs on it. Maybe this is just another fad but it feels to me like a fundamental shift in user experience (UX).
These speakers are a wonderful example of convergence of multiple technologies to create a new user application, which after all is what is driving much of the current tech boom. I don’t think I’ve found another company who explain the landscape, shortfalls, opportunities and sheer entertainment value of smart speakers as well as CEVA. Naturally they have a vested interest – they provide a lot of the IP behind the first stages of voice recognition – but they also clearly have a good perspective on the larger space.
Moshe Sheier (Dir Strategic Marketing) kicks off with a review of current and emerging smart speakers. We already know about Amazon Echo (the strong market leader) and Google Home (about a quarter of the market), with crumbs split between the likes of LG, Harmon Kardon and Lenovo. However, Apple has now entered the race with HomePod and Microsoft is planning to enter with Harmon Kardon. Amazon will also license Alexa to non-Amazon devices. Meanwhile in China, Alibaba, Baidu and others are marketing their own devices. An interesting feature in the Baidu device is a screen (also recently introduced by Amazon) and a camera to track and identify faces to authenticate purchase requests.
Moshe points out that devices look very similar, though with varying numbers of microphones (affecting far-field performance) and speakers, with bigger differences in the AI behind the speakers for voice recognition, interpretation, conversational skills, etc. This is where big players are likely to have a significant advantage and perhaps why Amazon is eager to license Alexa – the biggest revenue-maker may be the one who has the most extensive AI knowledge-base, rather than hardware. But that still points to an opportunity for hardware makers to expand voice recognition to many more domains – application specific hardware solutions, perhaps with some local intelligence, while leveraging big AI platforms in the cloud for detailed recognition and response.
Eran Belaish (PMM for audio/voice at CEVA) opens with a reminder of why voice-based control can be essential (Samsung Gear and GoPro Hero are good examples) and an entertaining look at Alexa conversational skills and Easter eggs (the kind of thing we enjoyed in the early days of Siri). He also adds that Gatebox from Japan has added a holographic character to add a further human (?) touch to interaction. Eran goes on to talk more about the component technology in and behind the speakers which I touched on in an earlier blog. This includes voice-activation and high-quality detection, supported through adaptive beam-forming (which is why more microphones are better for far-field detection) and can support speaker tracking and separation when multiple people are speaking. Acoustic echo-cancellation is another important component; the smart speaker must be able to distinguish speaker commands while it is playing music, as well as handling speech reflections beyond the primary source.
Eran then adds an enlightening and entertaining review to put bounds on the interpretation of “smart” in smart speakers. First up is the familiar ad-placement problem. Proprietary solutions will natural default to promoting sales through their own sites; Amazon and Alibaba are obvious potential offenders here. Perhaps litigation in Europe over search engine and Android biases will set guidelines before this becomes problematic. He follows with a fun video of Echo and Google Home trapping each other in an infinite loop. Even more entertaining, a TV news piece on a 6-year old saying “Alexa, buy me a dollhouse” managed to share the problem with those in the audience who had Amazon devices, becoming probably the first viral propagation through a TV news show.
Eran points out some other limitations. As we quickly learned with Siri, the AI behind voice recognition does a good superficial job of mimicking intelligence but it’s not difficult to trip up. Getting to true natural conversation or even understanding is still a long way off. On a different topic he talks about the need for hands-free interfaces especially in the context of battery-driven devices (I also talked about this in my earlier blog). This may seem odd at first – after all smart speakers are intrinsically hands-free, but even when they’re doing nothing they’re burning power listening for commands. In a battery-based device, this isn’t a good idea. Amazon’s Tap was an attempt to fix this but you still have to tap the device first. A better approach should use close-to-zero power listening technology to handle voice-based activation. He gets into this more in the next link.
Eran continues with a review of futures in voice recognition. This starts with that near-zero power voice-activation. A startup microphone company called Vesper, working with the DSP Group, have demonstrated a voice-activation system which burns as little as 100[SUP]th[/SUP] of the power of a standard microphone in standby. A second possibility he mentions is the potential for biometric identification through voiceprints. This would have obvious value when making purchases but can also be useful to discriminate between multiple potential users (Alexa, what’s my schedule today, when there may be multiple potential users). Eran suggests this capability, at some level, may be closer than we think.
He closes with a couple of other must-haves. Memory support is essential for moving closer to natural understanding. He cites as an example asking what beer he ordered last month for an event. The smart speaker has to search back in time for when you ordered that beer. A trickier case in my view, also requiring ambiguity resolution, would be something like “Alexa, where did I park the car?” followed by “Are the windows open?”. He wraps up with a discussion on emotion detection through voice analysis and ultimately also computer vision. BeyondVerbal has a product in this space which aims to help smart speakers tune their understanding to the emotion content of what you say. This is a very interesting domain – Siri and Alexa teams are already working with this company.
My suggestion – follow CEVA on voice and audio. These guys are plugged into to the trends and obviously are developing lots of technology to support this direction.