The Master Plan: how to build a voice interface for a billion people

Sahaj Garg

Mar 3, 2026

•

7 mins

Our goal at Wispr Flow is to build the first voice interface that a billion people use every day as their primary interaction with their devices. If we can, we'll make interacting with technology feel natural and effortless.

This idea is not new. The key question is: how do we get there?

We break our plan into three phases:

Reliable voice input
Voice to action
Ubiquity through wearables

The three phases

Phase one: reliable voice input.

The most frequent action on our devices is typing & communicating. If I can't trust voice as an input method, I'll never trust it for anything more complicated. While voice input has existed for decades, we're far from a billion people relying on it as their primary input method.

For voice to be reliable, it can't make mistakes — every mistake feels like a paper cut. Users will only tolerate so many before they go back to something they can depend on. To solve this, we have to build the most accurate, personalized voice models in the world. We frame this as pursuing a "zero edit rate" — you shouldn't have to edit anything before hitting send (including going from your rambles to a coherent output). Our approach: once you correct the same mistake twice, you should never have to correct it again. No voice models today support this; we are training context & memory driven voice models to make it possible.

Phase two: voice to action.

As people use voice for input everywhere on their devices, they build the desire for voice to drive actions. Everyone has approached this the same way: building a general purpose assistant that does a little bit of everything, and as a result does very little well. This is true of Alexa and Siri, but also of next gen HCI prototypes like ChatGPT Atlas and Perplexity Comet. Nobody wants an "AI operating system" — they want their life to be a little easier.

These approaches all assume users come with clear, high-intent commands: "book me a restaurant," "set a timer," "send this email." But that's a tiny fraction of how people think. The vast majority of what people want to express lives in a messy middle — half-formed thoughts, ambiguous intent, things you're thinking about but don't yet know what to do with. Voice to action is a spectrum of intent:

High-clarity intent → transactional actions. "Send this message." "Book a dinner at 7." You know what you want. Wispr executes.
Low-clarity intent → a second brain that works while you rest. "I've been thinking about restructuring the eng org..." You don't know what you want yet. Wispr captures those thoughts, organizes them, surfaces connections, and eventually proposes actions on your behalf.

Most of what people say falls somewhere in between, and nobody has built for that middle. Existing assistants only handle the high-clarity end (and poorly).

We'll solve this by building first party interactions for specific problems users experience on their computers. Instead of building a general solution, we'll make sure that 50 people who desperately want that problem solved fall in love with the UX. If we can do that, we'll invent the new HCI patterns required to make voice ubiquitous (hint: it won't be remembering 100 commands, and it will probably require some type of dynamic generative UI). For example: today, many of our users dictate into ChatGPT to improve their writing, copy the result, and paste it back. That's a workflow begging to be collapsed into a single interaction — improve your writing in place, with your voice. Others have tried this (including Apple Intelligence), but it's been poorly executed because they shipped the feature and moved on. Our focus is on optimizing the activation and engagement that turns a feature into a habit. From there, we open up a developer ecosystem for the hundreds of voice actions people want to do a few times a day.

We're not waiting for users to give us a perfect command. We're building for how people actually think.

Phase three: ubiquity through wearables.

Once we can rely on voice for input and actions, we no longer need to take a screen out of our pockets or look at our laptops for every interaction. Wispr will power the wearables that let you take voice everywhere.

An ecosystem of wearable hardware is emerging — smart rings, smart watches, smart glasses — each with different capabilities. Smart glasses offer display-based interactions and entertainment. Audio-only devices like rings and earbuds enable private voice input through sub-vocalization, even when you're sitting next to others. Some will pair with a phone or laptop; others will work standalone for capturing thoughts on the go, because the second brain doesn't require a screen.

Our goal is to power all of them. The specific question of whether we build hardware in-house or partner with other manufacturers is a decision that will become clear over time as the ecosystem matures. What matters is building the operating system that makes these devices useful and sticky — the voice interaction layer that turns wearable hardware from a novelty into a daily habit. Without great software, wearables are gadgets. With it, they become extensions of how you think and act.

People will love using voice so much that they'll want a wearable to take it everywhere — but that need doesn't exist yet.

Mistakes Wispr Flow and others have made

Most technology companies have approached this problem in the opposite order — starting with phase three or phase two first — and even we made the same mistake. As a result, they run into timing and habit formation challenges with the product because people aren't ready for it. The product doesn't take off, and everyone wonders why. After all, the tech is so cool!

For example: Humane and many other AI wearable companies started by building the hardware first. But until users understand why they'd want to bring a voice interaction with them everywhere, it's hard to get them hooked on not just a new hardware device, but also a new interaction model! Had they build software that was sticky first, then released hardware to make it ubiquitous, they would have been much more successful.

We also made this mistake at Wispr: we spent three years building a wearable brain computer interface to solve the problem of using voice around others. After years of R&D, we'd made progress on a wearable silent speech interface that decoded brain and neuromuscular signals to let people use voice everywhere. As the R&D matured, we ran into a challenge: we ourselves didn't want to use voice when we were in a closed room, so we started building software for the hardware.

At that point, we made the second mistake: starting with phase two of voice to action. Internally, we built an awesome assistant — I could say "book me a dinner with Tanay at 6 at a restaurant near our office and send a calendar invite," and it would do it. But unfortunately, I only ever took actions like that once every few weeks. Internally, we only "tested" the software, we never used it. Other voice assistants, like Alexa and Siri, have the same problem - at most, people use them a few times a day, not enough to build a habit of voice.

That's when we came back to phase one - reliable voice input - and we saw our own behaviors shift for the first time. As we built highly accurate voice dictation, we also realized how much more work there was to bring the voice models to the quality needed to get a billion people to ditch their keyboards. After all, we're competing against a tool people have been using for decades.

Ultimately, each phase is both a technological and psychological prerequisite for the next. Phase one psychologically primes users to trust voice - so they feel comfortable offloading their thoughts and actions to it. The first two psychologically prime the user to want the ubiquity of that interaction.

Market size

Eight billion people type every day — the market for voice interaction is enormous.

In phase one, we monetize through subscriptions: $12-15/mo for individuals and $24/mo/seat for enterprises — comparable to other tools that improve a daily workflow (Grammarly, Spotify, productivity software). Even at these rates, the math gets large quickly. At 1B DAU with a blended ARPU of ~$40/user across individual and enterprise plans, that implies north of $40Bn ARR — a $400Bn+ company, making reasonable assumptions about growth rates and gross margins. For reference, Spotify, Google, and Meta monetize at $50-200 per user annually, so $40 blended is conservative.

Phase two increases ARPU by expanding what users are willing to pay for. Voice input is a productivity tool; voice to action and a second brain become something closer to an operating system for how you work and think. The exact monetization model will become clear as the product matures, but the directional logic is simple: the more of your workflow Wispr handles, the more valuable it becomes, and the higher the willingness to pay.

Phase three adds hardware revenue on top of software subscriptions. We won't model that here, but a wearable with an ongoing software subscription is a meaningfully different business than software alone.

Importantly, phases two and three don't just increase ARPU — they increase retentiveness. Each phase makes the product harder to leave, which means the userbase compounds rather than churns. That's the difference between a large business and a durable one.

Why us

The most frequent question we get: why work on this when Apple, Google, OpenAI, and Anthropic are all in the same space?

Nobody is tackling phase one effectively — and with anything short of a best-in-class voice experience, people will give up on voice. The gap between "pretty good" and "perfect" is enormous for something you use hundreds of times a day. Plenty of operating systems ship with built-in text editors, but people still pay for specialized tools because when you're in a tool all day, the difference between adequate and excellent is everything. That's where our entire business lives.

Companies innovating on HCI are building tools that do everything, instead of building actions that truly improve people's workflows. They're building for power users who will restructure their lives around a new tool — not for the billion people who just want things to work. Companies building wearables are doing it without the software in mind for why people will use it. None of them will dedicate an entire company's focus to this problem, because for them it's a feature, not a business. If we dilute focus, it won't work. If they dilute focus, it won't work either.

The evidence that this approach works is already here. Wispr Flow has hundreds of thousands of daily active users, growing 40% month over month. We focus on problems where we'll have a differential impact: if we didn't solve it, it might not exist. If these problems were on the default path for other companies, we wouldn't waste our time.

Durability of the business

Even if we're the right team to build this, can the business endure once others recognize the value?

We think about durability through the lens of Hamilton Helmer's Seven Powers — the sources of strategic advantage that allow a business to sustain differential returns over time. Not all powers exist from day one. The question is which powers we have today, how we use them to buy time, and what powers we build into.

Counter-positioning (today)

Our strongest power today is counter-positioning. Big tech companies could build what we're building, but doing so would require them to fire their entire voice teams and rewrite their approach from scratch. Apple, Google, and others have spent years building general-purpose voice systems architected around a fundamentally different philosophy — command-based, non-personalized, audio-in text-out. Building context-aware, memory-driven, personalized voice models isn't a feature they can bolt on; it's a different architecture with different training data, different product assumptions, and different organizational expertise. On top of that, none of them will dedicate an entire company's focus to making voice input perfect, because for them it's a feature, not a business.

There's a second layer to this counter-positioning. We expect people to use many different AI agents over time — Claude and Claude Code for some tasks, Codex for others, xAI for entertainment, enterprise AI agents for specific workflows. The AI landscape is fragmenting, not consolidating. If Anthropic or OpenAI or Google builds a voice interface, they will never route your request to a competing model provider. So even if you love Claude's voice experience, if you prefer OpenAI for coding or xAI for something else, you're stuck — their voice layer only works with their own stack. Wispr is model-agnostic. We sit on top of all of these and let you use voice across whatever AI tools you choose. No AI company will build that, because it means promoting competitors.

This creates a window to scale aggressively while incumbents are structurally unable to respond. That window exists as long as incumbents treat voice as a feature rather than a business — and it closes when someone with distribution decides to go all-in. We don't know exactly when that happens, which is why the urgency to scale isn't strategic, it's existential. Counter-positioning buys us time, but it doesn't last forever. We have to use the window to build powers that endure.

Cornered resource (tentative)

We may have a second power today in the founding team. Sahaj and Tanay sit at an unusual intersection: ML research, product engineering, growth, consumer product, HCI, and interaction design — and we treat growth as a science, not an afterthought. Building a product like this requires all of those disciplines working in concert. Most companies have leaders who can drive one side (consumer product engineering) or the other (ML research), but rarely both. This is a presumptuous argument, and we'll only know if it's true in retrospect. But if it is, it's part of why we've been able to attract talented leaders across product engineering, HCI, sales, and marketing — and it's part of what makes the execution possible.

Switching costs (the long-term goal)

Counter-positioning and cornered resource buy us time. The power we're building toward is switching costs — and each phase deepens them.

In phase one, switching costs come from compounding personalization. Every correction teaches Wispr something deeper about how you think and speak: your vocabulary, your cadence, the way you structure ideas, the names and contexts that matter to you. Over months and years, Wispr builds a model of you that isn't a list of corrections — it's a representation of your voice as an individual. Leaving means going back to a product that makes mistakes you'd forgotten were possible. Death by a thousand paper cuts.

This is why we train our own voice models. The cases where voice fails today aren't where the model is broadly bad — they're where the audio alone is genuinely ambiguous. When you mumble a name, speak quickly in a noisy room, or sub-vocalize into a ring, the audio signal isn't enough to resolve what you said. The only way to get it right is to reason over context — what's on your screen, who you're talking to, what you were just working on — and to personalize over time. No existing voice model supports this. They treat transcription as a pure audio-in, text-out problem, with no ability to incorporate memory or context into decoding. That's why we have to build it ourselves. And when a user edits Wispr's output, that correction captures something uniquely valuable: the user's actual intent — their preferences, vocabulary, the way they want their thoughts expressed. If we hired annotators to label this data, they'd have no way of knowing what the user actually meant. Only the user knows, and they tell us every time they make a correction.

In phase two, switching costs deepen through the second brain. As people store more of their thoughts in Wispr — ideas, plans, half-formed intentions — it's not just that the data lives in Wispr; it's that Wispr understands the relationships between those thoughts and can act on them proactively. Six months in, when you mention "the deck," Wispr knows you mean the Series C investor presentation. When you open a doc to write a product brief, it surfaces the idea you offloaded on a walk three weeks ago because it recognizes the connection before you do. That memory is deeply personal and non-transferable. Leaving Wispr wouldn't mean losing a tool — it would mean losing an extension of how you think.

In phase three, switching costs compound further through hardware-software integration. When Wispr powers the wearable ecosystem — whether through devices we build ourselves or software we provide to other manufacturers — the voice interaction layer will be designed from the ground up around our voice models, with sub-vocalization tuned to our decoding and interaction patterns shaped by years of HCI research. That tight integration produces a superior experience that can't be replicated by pairing generic hardware with generic software. And because wearables make Wispr ambient — with you walking, commuting, in meetings, cooking dinner — the depth of engagement becomes something a software-only competitor can't match.

That's the reason for building toward phases two and three: each phase increases the switching costs that make the business durable. But switching costs only matter relative to the user base you've already acquired — which is why it's so important to use the counter-positioning window to scale as fast as possible.

Brand (emerging)

The final power we expect to build is brand. Voice is fundamentally a trust game. People will rely on the product they trust most, and trust compounds with every interaction that works and erodes with every one that doesn't. Here, we have an unexpected advantage: the incumbents' brands are actually liabilities on voice. People have had decades of losing trust in Apple, Google, and Amazon's ability to build voice products that work. Siri, Google Assistant, and Alexa have trained users to expect disappointment. If we become the premium voice brand — the one people associate with "it just works" — that brand power reinforces switching costs and gives us pricing power as the market matures.

What we don't claim

We don't have network effects — Wispr is a single-player product today, and the experience doesn't get better because someone else uses it. We may develop process power over time through our ability to systematically build first-party HCI experiences that become habits, but that remains to be proven. We're honest about which powers we have, which we're building, and which we may never have. The thesis is that counter-positioning + switching costs + brand, built in sequence, are enough to sustain a trillion-dollar business — and that the phasing strategy is specifically designed to accumulate those powers in the right order.

Conclusion

Is there risk? Of course. But if we do this right, we stand to transform technology for a billion users and build a trillion dollar company along the way.

In early 2026, we are at the inflection point between phase one and phase two. We're training our own large scale voice models for the first time, building a habit engine to help new users learn to use voice, and prototyping the first voice actions with our power users. The ML research is just kicking off, the voice-to-action work is in its earliest days, and the team is small enough that every person shapes what gets built. If you want to be at the ground floor of inventing the next generation of HCI, come join us.

‍