A hand holding an iPhone with Siri pulled up on the screen

Sure, everyone loves using Siri hands-free. However, not a lot of people understand what the underlying technology is that makes it work. If you're interested in learning though, you may want to check out this article published in Apple's Machine Learning Journal that details all the ins and outs of what happens when you call for Siri without pressing the button.

The article begins with a brief overview of how Siri recognizes that you're speaking to it:

A very small speech recognizer runs all the time and listens for just those two words. When it detects "Hey Siri", the rest of Siri parses the following speech as a command or query. The "Hey Siri" detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds. It then uses a temporal integration process to compute a confidence score that the phrase you uttered was "Hey Siri". If the score is high enough, Siri wakes up.

However, things get a bit more complex after that as the Siri team begins to break the sound detection process into even smaller components:

The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second. A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the "Hey Siri" phrase, plus silence and other speech, for a total of about 20 sound classes.

The article also touches on how they achieve maximum functionality without draining battery and processing power:

To avoid running the main processor all day just to listen for the trigger phrase, the iPhone's Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later). We use a small proportion of the AOP's limited processing power to run a detector with a small version of the acoustic model (DNN). When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN. In the first versions with AOP support, the first detector used a DNN with 5 layers of 32 hidden units and the second detector had 5 layers of 192 hidden units.

Finally, the team describes how the recordings taken during Siri set up (when your phone asks you to say five phrases with "Hey Siri" in them) further reduces the chances of annoying false triggers:

We compare any possible new "Hey Siri" utterance with the stored examples as follows. The (second-pass) detector produces timing information that is used to convert the acoustic pattern into a fixed-length vector, by taking the average over the frames aligned to each state. A separate, specially trained DNN transforms this vector into a "speaker space" where, by design, patterns from the same speaker tend to be close, whereas patterns from different speakers tend to be further apart. We compare the distances to the reference patterns created during enrollment with another threshold to decide whether the sound that triggered the detector is likely to be "Hey Siri" spoken by the enrolled user.

If you want to read more of the article, head over to Apple's Machine Learning Journal.

Thoughts?

Do you find knowing the ins and outs of processes like this interesting? Let us know in the comments!