Fundamental Challenges and Promising Directions in Speech Recognition Research

Ever used a voice transcription service and wondered “WTF, how did it think I said that?”
I have that experience sometimes when using Wispr Flow, but when I listen back to the audio, my reaction is: oh… I guess I can see how it thought that.
Let’s take a look at some examples from our own internal usage. Does this seem like the right transcription? Press the toggle below to reveal the answer.
In a lot of the above, the wrong answer could also be the right one — depending on the context and the speaker! But for people to rely on voice all the time, everywhere, we need to be able to get this right for our users. We’ve also heard enterprises relying on voice AI complain of this problem: nonsense transcriptions that should be obviously wrong given the context, but are plausibly correct given just the audio.
We can break down the problems into the following categories:
- Single word (or short) audio (ambiguous without the surrounding words)
- Separation from background noise
- Accent adaptation, especially for multilingual speakers
- Domain specific vocabularies for which there might not be enough training data (i.e. medical, legal terminology)
- Speech understanding
- Code switching (switching between multiple languages in a single sentence)
- Accurate Names (given context)
- Logically incoherent outputs given domain understanding
We’ve long hoped that improved automatic speech recognition (ASR) models would solve for these issues - but on our “hard” test sets, advancements in ASR models that have gone from 3% English word error rates (WERs) to 2% have hardly made a dent - because the audio is inherently ambiguous - and often use speech LLMs which don’t always transcribe the audio and frequently go off the rails.
At Wispr, to solve this problem, we look at the way that people understand each other:
- They know the other person’s voice — that’s how we can pick up a person clearly with lots of background noise, or if they have an accent we’re not used to
- They know the context — after understanding what a user is talking ABOUT, it’s so much easier to figure out what makes sense vs. doesn’t
- Sometimes we have to ask a person to repeat themselves - “Come again?”
Our approach is to treat ASR as a mixed-modal, context driven problem. To figure out what a person said, we use not just the audio but also:
- Context from the app they’re dictating into
- History of relevant topics they talk about
- Information about names or terms they commonly use
- Embeddings of the user’s voice
With this kind of information, it’s finally fundamentally possible to push the frontiers of ASR in non-ideal (and ideal) conditions. We need to identify what the user said correctly (using context), or let them know what part of the audio we had trouble understanding and our best guesses at what they might have said.
If you’d like to work on a new approach to ASR that solves for fundamental limitations in the current state of the art, come join us at Wispr!

Start flowing
Effortless voice dictation in every application: 4x faster than typing, AI commands and auto-edits.