All Blogs

Technical Challenges Behind Flow

written by
Sahaj Garg
Sahaj Garg
CTO, Wispr Flow
Date
Sep 11, 2025
READ TIME
3 mins

At Wispr Flow, we are building:

  • The world’s best ASR models (context aware, personalized, and code-switched)
  • Cloud based speech processing infrastructure for 1B users
  • Personalized LLMs with token level formatting control
  • The most intuitive UX for voice interfaces across desktop and mobile devices
  • A habit forming, easy to learn interaction
  • Software (for hardware) that allows users to use voice around other people

An unordered list of short, concrete problems that we need to solve along the way:

  • Optimizing Latency. Our users expect full transcription and LLM formatting/interpretation of their speech within 700ms of when they stop speaking. Any slower, and users get impatient. We are continuously deploying larger models within this same budget - because every edit after the fact adds more time than anything else. We need to optimize model inference so we can run E2E ASR inference in <200ms, E2E LLM inference in <200ms, and have a maximum networking budget of 200ms from anywhere around the world with spotty internet connections.
  • Handling ambiguous audio. There’s inherent ambiguity in audio. Take some extreme examples: a single word audio clip. Unless you know (a) the person’s voice, (b) what kinds of topics they talk about, and (c) the surrounding context, it’s impossible to figure out with confidence what they said. We’re building context-conditioned ASR models (conditioned on speaker qualities, surrounding context, and individual history).
  • Learning from corrections. Users often correct the output of Flow dictations - we want to get these corrections down to zero. To do that, we have to figure out how to accurately capture edits on a user’s device, determine whether edits should not be repeated (and in which contexts to apply the edits), learn a local RL policy to align with a user’s particular style preference, and train a LLM to precisely follow those edits. We’re finally at a point where we can build a product that never makes the same mistake twice!
  • Personalized LLM formatting. Every person types differently - and communication styles are key to how we convey our tone in writing. Often, these differences in style are at the token level (rewrite in this way, prefer dashes or commas, skip capitalization but only in certain contexts etc). The key challenge with deploying this: LLMs are phenomenal at recall, but very low precision.
  • Sub-audible speech (i.e. subvocalization). A lot of people use Flow around others, and it’s possible to speak so quietly into a microphone that nobody around you can hear you. Flow already works reasonably well in these settings, but has a higher error rate than otherwise, because no ASR systems have been trained to solve this problem. The data is also nearly impossible to label from expert labeling sources, since the audio is so quiet that without context, it’s not possible to figure out what a user has said.
  • New Habit Formation. People are generally used to typing on their computers, not speaking, and Flow is an invisible user interface. Finding ways to teach wider and wider (and less technical audiences) how to build the habit of speaking to their computers instead of typing will require innovation in UX and design. Design for new habit formation matters for the core voice dictation habit, but also for other features of the voice interface that we introduce over time.
  • Leveraging context. Given latency requirements (200ms LLM inference) and privacy constraints (most personalization data must live on a user’s device), we have to determine how to represent and store context information (i.e. a user’s dictation history or current app context) in a way that can be informative to the ASR model or to the LLM.
  • Communicating Uncertainty. The magic of voice comes from when you don’t have to look over your outputs and can use it immediately. However - that’s not always the case - and we want users to know when (and what) to review. This requires innovation in both the UX and the modeling side (for calibrated uncertainty).
  • Multilingual code-switching. Around the world, most people speak in multiple languages - often using more than one language in the same utterance. Almost no ASR models to date are particularly effective at handling this, and most speech-LLMs might understand the gist of what a person has said, but struggle to transcribe multiple languages in the same sentence.
  • Scale. Users today dictate 1 billion words a month with Flow, and we expect this to grow by another 10x in the next six months. Processing this data at 99.99% uptime with ultra-low latency is key to providing a stellar UX.
  • Hardware devices for using voice everywhere. Very soon, we’re going to hit a point where users want to use voice everywhere - including around other people. This will require innovation in hardware, form factors, and UX to design the right device for these needs. We won’t get to tackle this problem until late 2026 (at the earliest), but solving it will be necessary for rolling out voice interfaces to billions of users.

As we tackle and solve these problems, we open up opportunities for Flow to be a voice interface that does more than just write for you — we want users to have a voice interface that can both do things for them and proactively help them.

If these problems sound exciting to you, reach out at jobs.wisprflow.ai!

Start flowing

Effortless voice dictation in every application: 4x faster than typing, AI commands and auto-edits.

Available on Mac, Windows and iPhone