Researchers at the Norwegian University of Science and Technology (NTNU) are combining two of the best-known approaches to automatic speech recognition to build a better and language-independent speech-to-text algorithm that can recognize the language being spoken in under a minute, transcribe languages on the brink of extinction, and make the dream of ever present voice-controlled electronics just a little bit closer.
Achieving accurate, real-time speech recognition is no easy feat. Even assuming that the sound acquired by a device can be completely stripped of background noise (which isn’t always the case), there is hardly a one-to-one correspondence between the waveform detected by a microphone and the phoneme being spoken. Different people speak the same language with different nuances – accents, lisps and other articulation defects. Other factors such as age, gender, health and education also play a big role in altering the sound that reaches the microphone.
The NTNU researchers are now pioneering an approach that, if it can be fully exploited, may lead to a big leap in the performance of speech-to-text applications. They demonstrated that the mechanics of human speech are fundamentally the same across all people and across all languages, and they are now training a computer to analyze the pressure of sound waves captured by the microphone to determine which parts of the speech organs were used to produce a phoneme.

