Speech recognition technology allows us to ask Siri to check the weather for tomorrow, or to ask Alexa to play our favorite song.
But those technologies require audible speech. What if a person can’t speak, or if vocalized speech in a particular setting isn’t appropriate?
Cheng Zhang, assistant professor of information science in the Cornell Ann S. Bowers College of Computing and Information Science, and doctoral student Ruidong Zhang have an answer: SpeeChin, a silent-speech recognition (SSR) device that can identify silent commands using images of skin deformation in the neck and face captured by a neck-mounted infrared (IR) camera.
The technology is detailed in “SpeeChin: A Smart Necklace for Silent Speech Recognition,” published Dec. 31 in Proceedings of the Association of Computing Machinery on Interactive, Mobile, Wearable and Ubiquitous Technologies.
Ruidong Zhang is also presenting the paper in October at the Ubiquitous Computing (UbiComp 2022) conference.
“There are two questions: First, why a necklace? And second, why silent speech?” Zhang said. “We feel a necklace is a form factor that people are used to, as opposed to ear-mounted devices, which may not be as comfortable. As far as silent speech, people may think, ‘I already have a speech recognition device on my phone.’ But you need to vocalize sound for those, and that may not always be socially appropriate, or the person may not be able to vocalize speech.
“This device has the potential to learn a person’s speech patterns, even with silent speech,” he said.
“We’re introducing an entirely new form factor, new hardware, into this field,” said Ruidong Zhang, who built the original prototype in 2020 at his home in China, while completing the first year of his doctoral program remotely.
The device is similar in appearance to NeckFace, a technology Cheng Zhang and his SciFi Lab team members unveiled last year. NeckFace continuously tracks full facial expressions by using infrared cameras to capture images of the chin and face from beneath the neck.
Like NeckFace, SpeeChin features an IR camera mounted on a 3D-printed necklace case, which is hung on a silver chain with the camera pointing up at the wearer’s chin. For increased stability, the developers designed a wing on each side, and placed a coin on the bottom.
Convenience and privacy, Cheng Zhang said, are two reasons that a necklace-mounted IR camera could be preferrable to a traditional camera in front of the face. “A camera in front of your face is taking pictures of what’s behind you,” he said, “and that raises privacy concerns.”
For their initial experimentation, which featured 20 participants (10 speaking English, 10 Mandarin Chinese), measurements were taken to determine the baseline position of the chin, then differential images were used to train the device to recognize simple commands.
Ruidong Zhang had participants utter 54 commands in English, consisting of digits, interactive commands, voice assistant commands, punctuation commands and navigation commands. He then did the same with 44 simple Mandarin words or phrases.
SpeeChin recognized commands in English and Mandarin with an average accuracy of 90.5% and 91.6%, respectively. To further test its limits, the researchers conducted another study with 10 participants, all silently uttering a specially designed list of 72 one-syllable “nonwords” using phonemes – combinations of 18 consonants and four vowels.
Finally, the researchers recruited six participants to utter 10 Mandarin and 10 English phrases while walking. Success rates were lower in this study, due in part to the variation in walking styles (more versus less head movement, for example) among the participants.
The project illustrates the power of determination: Ruidong Zhang built a lab in his home, complete with soldering station, and recruited people in his hometown as study participants.
“But because I live in a small city and it’s hard to find people who speak English,” he said, “we actually ended up in HangZhou, at Zhejiang University, to recruit English-speakers. It was an unforgettable experience for me.”
Support for this work came from the Cornell Department of Information Science, and partially from a Shanghai Jiao Tong University-Cornell seed grant from the Cornell China Center.