What you need to know
- Apple is using Apple Podcasts to train Siri to understand people with a stutter.
- The company has collected 28,000 audio samples from podcasts to aid in the research.
Apple is looking to improve Siri to understand users with atypical speech patterns, such as those with a stutter. In order to support the effort, the company is extracting audio samples from Apple Podcasts that will help train Siri to understand more kinds of speech.
According to a report from the Wall Street Journal (via 9to5Mac), Apple has built a bank of 28,000 audio clips from podcasts that feature someone with a stutter.
The company is now researching how to automatically detect if someone speaks with a stutter, and has built a bank of 28,000 audio clips from podcasts featuring stuttering to help do so, according to a research paper due to be published by Apple employees this week that was seen by the Wall Street Journal.
As noted by the report, Siri can misinterpret Apple users with a stutter as ending a voice command due to pauses in their speech.
Siri can be voice activated on iPhones, iPads, and Macs, and especially HomePod and HomePod mini, using the "Hey Siri" voice command followed by a request. For users who stutter, however, the current version of Siri commonly interprets pauses in speech as the end of a voice command. In turn, this prevents the voice assistant from reaching its full potential for a collection of customers.
Apple's research paper says that they are specifically studying "five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections."
The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work, we introduce Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter. We benchmark a set of acoustic models on SEP-28k and the public FluencyBank dataset and highlight how simply increasing the amount of training data improves relative detection performance by 28% and 24% F1 on each. Annotations from over 32k clips across both datasets will be publicly released.
Apple believes that, despite the first use being to aid in understanding those who stutter, the research could continue to also include other things like dysarthria.