Any voice-controlled product can work well in a quiet room, but that's not where your customers live. TalkTo™ is a suite of proprietary Machine Learning & Microphone Processing algorithms that deliver a quiet-room experience in the noisy & unpredictable spaces of everyday life.
Voice-controlled products span the spectrum of embedded design so does TalkTo. Each TalkTo variant is scaled and optimized for the capabilities and requirements of the target application.
Set Top Box
Battery Optimized Devices
Smart Home / IoT
TV / Smart Screen
Anatomy of a Voice-Controlled Product
Audio Front End (AFE)
Conceptually, the AFE serves as a microphone cleaner. It uses the raw, noisy audio from the microphones to detect, extract, and clean any speech-activity found in the ambient sound field. A cleaned, reconstructed voice-stream is output to the Wake Word Engine.
Wake Word Engine (WWE)
Wake-Word Engines continuously scan the AFE-provided voice-stream, looking for the presence of a specific phrase (e.g. "Alexa"). The accuracy of the WWE is one of the primary drivers of UX-quality, and is highly dependent upon the purity of the AFE-generated voice-stream.
Back End Services
Once the wake-word is detected, the extracted voice-command is passed to a [cloud-based or local] voice-service which converts speech-to-text (ASR) and determines user-intent (NLU). Again, accuracy relies on a high-quality voice-stream from the Audio Front End.
For a high-quality voice-experience, products with audio-output(s) must have low speaker-distortion. To achieve this, the output-audio is processed to ensure it doesn't exceed the physical capabilities of the speaker(s). Audio Weaver® is a fast path solution to this problem.
The video below shows the 6-Microphone, Amazon Qualified, TalkTo - Set Top Box AFE running on an Amlogic A113 board. The system is running stand-alone, directly below the TV's speakers (playout at 90+ dBC!). Even without the closed-loop benefit of AEC (there's no reference signal to eliminate), the TalkTo AFE is still able to detect and extract faint voice-commands from the ambient sound-field; allowing the WWE and Voice Services to deliver a robust user-experience, even in the harshest of conditions!