Our goal at Wispr Flow is to build the first voice interface that a billion people use every day as their primary interaction with their devices. If we can, we'll make interacting with technology feel natural and effortless.
This idea is not new. The key question is: how do we get there?
We break our plan into three phases:
- Reliable voice input
- Voice to action
- Ubiquity through wearables
The three phases
Phase one: reliable voice input.
he most frequent action on our devices is typing & communicating. If I can't trust voice as an input method, I'll never trust it for anything more complicated. While voice input has existed for decades, we're far from a billion people relying on it as their primary input method.
For voice to be reliable, it can't make mistakes — every mistake feels like a paper cut. Users will only tolerate so many before they go back to something they can depend on. To solve this, we have to build the most accurate, personalized voice models in the world. We frame this as pursuing a "zero edit rate" — you shouldn't have to edit anything before hitting send (including going from your rambles to a coherent output). Our approach: once you correct the same mistake twice, you should never have to correct it again. No voice models today support this; we are training context & memory driven voice models to make it possible.
Phase two: voice to action.
As people use voice for input everywhere on their devices, they build the desire for voice to drive actions. Everyone has approached this the same way: building a general purpose assistant that does a little bit of everything, and as a result does very little well. This is true of Alexa and Siri, but also of next gen HCI prototypes like ChatGPT Atlas and Perplexity Comet. Nobody wants an "AI operating system" — they want their life to be a little easier.
These approaches all assume users come with clear, high-intent commands: "book me a restaurant," "set a timer," "send this email." But that's a tiny fraction of how people think. The vast majority of what people want to express lives in a messy middle — half-formed thoughts, ambiguous intent, things you're thinking about but don't yet know what to do with. Voice to action is a spectrum of intent:
- High-clarity intent → transactional actions. "Send this message." "Book a dinner at 7." You know what you want. Wispr executes.
- Low-clarity intent → a second brain that works while you rest. "I've been thinking about restructuring the eng org..." You don't know what you want yet. Wispr captures those thoughts, organizes them, surfaces connections, and eventually proposes actions on your behalf.
Most of what people say falls somewhere in between, and nobody has built for that middle. Existing assistants only handle the high-clarity end (and poorly).
We'll solve this by building first party interactions for specific problems users experience on their devices. Instead of building a general solution, we'll make sure that 50 people who desperately want that problem solved fall in love with the UX. If we can do that, we'll invent the new HCI patterns required to make voice ubiquitous (hint: it won't be remembering 100 commands, and it will probably require some type of dynamic generative UI). For example: today, many of our users dictate into ChatGPT to improve their writing, copy the result, and paste it back. That's a workflow begging to be collapsed into a single interaction — improve your writing in place, with your voice. Others have tried this (including Apple Intelligence), but it's been poorly executed because they shipped the feature and moved on. Our focus is on optimizing the activation and engagement that turns a feature into a habit. From there, we open up a developer ecosystem for the hundreds of voice actions people want to do a few times a day.
Phase three: ubiquity through wearables.
Once we can rely on voice for input and actions, we no longer need to take a screen out of our pockets or look at our laptops for every interaction. Wispr will power wearables that let you take voice everywhere.
The device we want to build is a smart ring with a capacitive touchpad for cursor control and clicking — an input device for pointing, clicking, and voice. Critically, it enables private voice interaction: you can sub-vocalize quietly into the ring, even sitting next to others in a meeting or on a train.
This unlocks two modes. Paired with a laptop or phone, it gives you sub-vocal voice input everywhere — turning any environment into one where voice works. It still works without a screen, because the whole point of the second brain is offloading thoughts that spontaneously come to mind, and most of that doesn't require visual response.
This also plays well with the broader wearable ecosystem. Smart glasses with displays are being built by other companies — we don't need to build that hardware ourselves. For audio-only form factors like the ring, we build the hardware. For display devices, we build the software. Over time, Wispr becomes the voice interaction layer for all of these devices. Any device with a microphone, internet connection, and optionally a display becomes a computer.
People will love using voice so much that they'll want a wearable to take it everywhere — but that need doesn't exist yet.
Mistakes Wispr Flow and others have made
Most technology companies have approached this problem in the opposite order — starting with phase three or phase two first — and even we made the same mistake. As a result, they run into timing and habit formation challenges with the product because people aren't ready for it. The product doesn't take off, and everyone wonders why. After all, the tech is so cool!
For example: Humane and many other AI wearable companies started by building the hardware first. But until users understand why they'd want to bring a voice interaction with them everywhere, it's hard to get them hooked on not just a new hardware device, but also a new interaction model! Had they build software that was sticky first, then released hardware to make it ubiquitous, they would have been much more successful.
We also made this mistake at Wispr: we spent three years building a wearable brain computer interface to solve the problem of using voice around others. After years of R&D, we'd made progress on a wearable silent speech interface that decoded brain and neuromuscular signals to let people use voice everywhere. As the R&D matured, we ran into a challenge: we ourselves didn't want to use voice when we were in a closed room, so we started building software for the hardware.
At that point, we made the second mistake: starting with phase two of voice to action. Internally, we built an awesome assistant — I could say "book me a dinner with Tanay at 6 at a restaurant near our office and send a calendar invite," and it would do it. But unfortunately, I only ever took actions like that once every few weeks. Internally, we only "tested" the software, we never used it. Other voice assistants, like Alexa and Siri, have the same problem - at most, people use them a few times a day, not enough to build a habit of voice.
That's when we came back to phase one - reliable voice input - and we saw our own behaviors shift for the first time. As we built highly accurate voice dictation, we also realized how much more work there was to bring the voice models to the quality needed to get a billion people to ditch their keyboards. After all, we're competing against a tool people have been using for decades.
Ultimately, each phase is both a technological and psychological prerequisite for the next. Phase one psychologically primes users to trust voice - so they feel comfortable offloading their thoughts and actions to it. The first two psychologically prime the user to want the ubiquity of that interaction.
Market size
Eight billion people type every day — the market for voice interaction is enormous.In phase one, we monetize through subscriptions: $12-15/mo for individuals and $24/mo/seat for enterprises — comparable to other tools that improve a daily workflow (Grammarly, Spotify, productivity software). Even at these rates, the math gets large quickly. At 1B DAU with a blended ARPU of ~$40/user across individual and enterprise plans, that implies north of $40Bn ARR — a $400Bn+ company, making reasonable assumptions about growth rates and gross margins. For reference, Spotify, Google, and Meta monetize at $50-200 per user annually, so $40 blended is conservative.
Phase two increases ARPU by expanding what users are willing to pay for. Voice input is a productivity tool; voice to action and a second brain become something closer to an operating system for how you work and think. The exact monetization model will become clear as the product matures, but the directional logic is simple: the more of your workflow Wispr handles, the more valuable it becomes, and the higher the willingness to pay.
Phase three adds hardware revenue on top of software subscriptions. We won't model that here, but a smart ring with an ongoing software subscription is a meaningfully different business than software alone.
Importantly, phases two and three don't just increase ARPU — they increase retentiveness. Each phase makes the product harder to leave, which means the userbase compounds rather than churns. That's the difference between a large business and a durable one.
This raises two questions: can we execute on this, and if we do, can we build an enduring business?
Why us
The most frequent question we get: why work on this when Apple, Google, OpenAI, and Anthropic are all in the same space?
Based on the mistakes outlined above, I don't think we'll see this transformation unless we work on it. Nobody is tackling phase one effectively — and with anything short of a best-in-class voice experience, people will give up on voice. The gap between "pretty good" and "perfect" is enormous for something you use hundreds of times a day. Plenty of operating systems ship with built-in text editors, but people still pay for specialized tools because when you're in a tool all day, the difference between adequate and excellent is everything. That's where our entire business lives.
Companies innovating on HCI are building tools that do everything, instead of building actions that truly improve people's workflows. They're building for power users who will restructure their lives around a new tool — not for the billion people who just want things to work. Companies building wearables are doing it without the software in mind for why people will use it.
Getting this right requires a team that operates as both a consumer product company and a research lab: world-class growth and interaction design alongside deep ML research on voice models, all pointed at the same set of problems. You need people who care about activation and habit formation just as much as model architecture — and all of them have to be intensely focused on voice. That combination is extremely hard to assemble, and even harder to maintain if your attention is split across many product lines. If we dilute focus, it won't work. If larger companies dilute focus, it won't work either. That's our advantage.
We focus on problems where we'll have a differential impact: if we didn't solve it, it might not exist. If these problems were on the default path for other companies, we wouldn't waste our time.
Durability of the business
Even if we're the right team to build this, can the business endure once others recognize the value?
Durability through compounding personalization.
The moat in phase one isn't just accuracy — it's that personalization compounds. Every correction teaches Wispr not just a single fix, but something deeper about how you think and speak: your vocabulary, your cadence, the way you structure ideas, the names and contexts that matter to you. Over months and years, Wispr builds a model of you that isn't a list of corrections — it's a representation of your voice as an individual. That's exponentially harder to replicate than matching our base accuracy, and it gets harder the longer someone uses the product.
This is why we train our own voice models. The cases where voice fails today aren't where the model is broadly bad — they're where the audio alone is genuinely ambiguous. When you mumble a name, speak quickly in a noisy room, or sub-vocalize into a ring, the audio signal isn't enough to resolve what you said. The only way to get it right is to reason over context — what's on your screen, who you're talking to, what you were just working on — and to personalize over time. No existing voice model supports this. They treat transcription as a pure audio-in, text-out problem, with no ability to incorporate memory or context into decoding. That's why we have to build it ourselves.
Training our own models also gives us a data flywheel no one else can replicate. When a user edits Wispr's output after dictating, that correction captures something uniquely valuable: the user's actual intent. It's not just a signal that the model got something wrong — it reveals preferences, vocabulary, the way they want their thoughts expressed. If we hired annotators to label this data, they'd have no way of knowing what the user actually meant. Only the user knows, and they tell us every time they make a correction. This is what allows us to personalize at the individual level, not just improve the model for everyone.
Durability through the second brain.
True long-term durability comes from the second brain. As people store more of their thoughts in Wispr — ideas, plans, half-formed intentions, things they're working through — switching costs compound in a fundamentally different way. It's not just that your data lives in Wispr; it's that Wispr understands the relationships between your thoughts and can act on them proactively. Six months in, when you mention "the deck," Wispr knows you mean the Series C investor presentation. When you open a doc to write a product brief, it surfaces the idea you offloaded on a walk three weeks ago because it recognizes the connection before you do. That memory is deeply personal and non-transferable. Leaving Wispr wouldn't mean losing a tool — it would mean losing an extension of how you think.
Durability through hardware and platform independence.
As software, Wispr lives on top of operating systems controlled by Apple, Google, and Microsoft. Any of them could restrict API access or ship a competing feature and preference it. For a company targeting a billion users, full dependence on someone else's OS is an existential risk. The smart ring — and the software layer we build for other wearable platforms — gives us a direct hardware relationship with the user that no platform can mediate. Beyond platform independence, hardware fundamentally changes engagement depth. Software-only Wispr works when you're at your laptop or on your phone. The ring means Wispr is with you for the vast majority of your day — walking, commuting, in meetings, cooking dinner. Wispr stops being an app you open and becomes an ambient layer you speak to throughout your day. That's not incrementally more usage; it's a different kind of relationship.
Is there risk? Of course. But if we do this right, we stand to transform technology for a billion users and build a trillion dollar company along the way.
In early 2026, we are at the inflection point between phase one and phase two. We're training our own large scale voice models for the first time, building a habit engine to help new users learn to use voice, and prototyping the first voice actions with our power users. The ML research is just kicking off, the voice-to-action work is in its earliest days, and the team is small enough that every person shapes what gets built. If you want to be at the ground floor of inventing the next generation of HCI, come join us.

Start flowing
Effortless voice dictation in every application: 4x faster than typing, AI commands and auto-edits.
