Fundamental Challenges and Promising Directions in Speech Recognition Research

written by

Sahaj Garg

CTO, Wispr Flow

Date

September 17, 2025

READ TIME

5 mins read

Ever used a voice transcription service and wondered “WTF, how did it think I said that?”

I have that experience sometimes when using Wispr Flow, but when I listen back to the audio, my reaction is: oh… I guess I can see how it thought that.

Let’s take a look at some examples from our own internal usage. Does this seem like the right transcription? Press the toggle below to reveal the answer.

‍

Raw Transcription: “Like, not even Spuria’s background”

Reveal actual transcription

Desired Content: “like, not even spurious background”

Key Challenge: A lot of users use Flow by whispering to the model so they can use it around other people. Whispered audio has inherent phonetic ambiguity to it - voiced and unvoiced phonemes sound exactly the same, so the only way to perfectly disambiguate is by figuring out what audio makes sense. In this case, talking about someone’s background is likely to refer to a person’s background — but if there’s no related reference to a person named Spuria, then the word spurious makes even more sense.

‍

Raw Transcription: “Hey Lauren”

Reveal actual transcription

Desired Content: “Hey Laurent”

Key Challenge: Short dictations like this are virtually indistinguishable, and either of them could be reasonable given the audio. The only way to know whether the user is referring to Lauren or Laurent is by looking at the surrounding context. Are they replying to an email or a text message with the name Lauren? Is the email address of the recipient laurent123_fake_email@gmail.com?

‍

Raw Transcription: “Stephen Bartlett is diarivus here by the way, they're one.”

Reveal actual transcription

Desired Content: “Steven Bartlett is Diary of a CEO by the way, they're one.”

Key Challenge: This user may never have spoken about Steven before, or his podcast. But with global domain knowledge it’s actually possible to figure out what they might have been saying (including spelling his name correctly).

‍

Raw Transcription: “哈哈”

Reveal actual transcription

Desired Content: “Ha Ha”

Key Challenge: Suppose this speaker spoke both Mandarin (in their New-Zealandese accent) and English. How are we to figure out what they actually said?

‍

Raw Transcription: “How you doing? I wanna find something in here”

Reveal actual transcription

Desired Content: “Let’s do it. Wanna find some time here?”

Key Challenge: Frequently, when people use Flow on iOS devices, the audio quality is poor. If you listen intently and know that this person often says the phrase "want to find some time here,” it's much easier to figure out what they're actually saying. For me to figure this out, I had to listen to it three times and know that the speaker was my co-founder who often uses this phrase.

‍

Raw Transcription: “He is continuing to take gabapentin 903 times a day…”

Reveal actual transcription

Desired Content: “He is continuing to take gabapentin 900 3 times a day”

Key Challenge: This one’s borrowed from a doctor’s analysis (source). “Someone who actually takes gabapentin 903 times a day will get a dose of 90,300 mg, assuming 100 mg tabs. This amount is enough to render you unconscious before taking the 903rd dose. But the software doesn’t know that. All it knows is that the speaker said, “nine hundred-three times a day.” The text is absurd if you read it literally, but if you know medicine, you can probably figure out what the clinician meant from the context.

‍

In a lot of the above, the wrong answer could also be the right one — depending on the context and the speaker! But for people to rely on voice all the time, everywhere, we need to be able to get this right for our users. We’ve also heard enterprises relying on voice AI complain of this problem: nonsense transcriptions that should be obviously wrong given the context, but are plausibly correct given just the audio.

‍

We can break down the problems into the following categories:

Single word (or short) audio (ambiguous without the surrounding words)
Separation from background noise
Accent adaptation, especially for multilingual speakers
Domain specific vocabularies for which there might not be enough training data (i.e. medical, legal terminology)
Speech understanding
Code switching (switching between multiple languages in a single sentence)
Accurate Names (given context)
Logically incoherent outputs given domain understanding

‍

We’ve long hoped that improved automatic speech recognition (ASR) models would solve for these issues - but on our “hard” test sets, advancements in ASR models that have gone from 3% English word error rates (WERs) to 2% have hardly made a dent - because the audio is inherently ambiguous - and often use speech LLMs which don’t always transcribe the audio and frequently go off the rails.

‍

At Wispr, to solve this problem, we look at the way that people understand each other:

They know the other person’s voice — that’s how we can pick up a person clearly with lots of background noise, or if they have an accent we’re not used to
They know the context — after understanding what a user is talking ABOUT, it’s so much easier to figure out what makes sense vs. doesn’t
Sometimes we have to ask a person to repeat themselves - “Come again?”

‍

Our approach is to treat ASR as a mixed-modal, context driven problem. To figure out what a person said, we use not just the audio but also:

Context from the app they’re dictating into
History of relevant topics they talk about
Information about names or terms they commonly use
Embeddings of the user’s voice

‍

With this kind of information, it’s finally fundamentally possible to push the frontiers of ASR in non-ideal (and ideal) conditions. We need to identify what the user said correctly (using context), or let them know what part of the audio we had trouble understanding and our best guesses at what they might have said.

If you’d like to work on a new approach to ASR that solves for fundamental limitations in the current state of the art, come join us at Wispr!

We can break down the problems into the following categories:

At Wispr, to solve this problem, we look at the way that people understand each other:

Our approach is to treat ASR as a mixed-modal, context driven problem. To figure out what a person said, we use not just the audio but also:

Start flowing