“Speech Recognition” is one of those instantly recognizable terms that are seemingly simple, but which refer to many different features collected under the same umbrella. Keyword spotting, language translation, and speech transcription are examples, but each works differently, has various applications and goals and has significantly different resource requirements.
Endpoint AI and Speech Recognition
Speech-based user interfaces are highly desirable for tiny endpoint devices such as smartwatches and earbuds, where other ways of interfacing are difficult or impossible.
Nowadays, most of these platforms already have a microphone, making speech a natural way to interact with them. As AI developers, our job is to make the speech interface useful and natural for the end-users.
Endpoint AI is all about processing as much as possible at the endpoint to improve responsiveness, privacy, and power efficiency. AI is computationally intense, so traditionally, any ‘heavy lifting’ was left to the cloud. Endpoint devices use AI to listen for a specific speech pattern to ‘wake up,’ but beyond that they have to send all speech to the cloud for processing and transcription. This round trip is slow, power-intensive, and raises privacy concerns. Ambiq’s ultra-low-power capabilities mean that we can offer sophisticated AI right at the Endpoint, reducing or eliminating the need to talk to the cloud.
Superficial Speech Recognition vs. Understanding Language
It is useful to separate speech AI algorithms into those that process speech superficially, and those that are based on an understanding of language. Superficial AI algorithms treat speech as audio – they don’t understand what you are saying, only what it sounds like. Deep algorithms start here, too, recognizing blobs of sound (phonemes) that form words, but they take it a step further, using a language model to fit those blobs into words that make sense in the larger context of your speech.
Superficial AI algorithms perceive speech very much as a dog would, in that a dog does not know the meaning of words or phrases but understands ‘fetch the ball!’. A dog will understand that you want to play fetch for many variations of the phrase (being an AI nerd, I’ve explored just how far I can stray from a fixed phrase when playing with my dogs, and it turns out it’s pretty darned far). Likewise, a superficial AI algorithm does not understand what any of the words in “turn on the kitchen lights” means, but can deduce the intent of the phrase.
Deeper language-understanding algorithms add a ‘language model’ to this superficial approach. Language models are trained on large text sources such as Wikipedia and news articles, learning grammatical rules and language constructs useful when recognizing speech. For example, if a speaker mumbles part of a sentence or uses a homonym, a language model will use the preceding and following words to make a good guess at what the speaker meant. Fascinatingly, it turns out that language models are not language-specific, which implies that language rules are universal to us humans, but which also means you can use them to translate between languages accurately. Modern language-model-based approaches are ‘end-to-end’: instead of having explicitly different components implementing speech and language models, they are trained as one big model, increasing their accuracy and effectiveness.
From a practical point of view, these two algorithm categories have very different compute resource requirements. Superficial algorithms lend themselves to compact models, making them useful to Endpoint AI applications (assuming you’ve solved the power problem). Language models require much more computational power and memory – the small ones weigh in at around 100MB, and the largest requires many gigabytes of RAM and dedicated AI processors.
The good news is that “superficial” models are amazingly useful. The dog analogy is a pretty good analogy for how useful these superficial models are. Much as you can train your dog to respond to dozens of simple commands even when phrased differently, we can produce practical and efficient models that respond to your speech usefully.
In our next blog post, we’ll dig into the details of how superficial speech AI models work.