I thought finding a yoga voice would be easy. It rebuilt my entire app.

How we curated a specific vocal identity to ensure the audio guidance feels supportive, instructive and calm.

One of the hardest feature I built wasn't the code. It was the voice.

When I started building Yogakosh, I had a clear picture of what it needed to be: audio-first. Not a screen you follow, but a voice you trust. The kind that guides you through a 30-minute flow the way a good teacher would — steady, precise, knowing when to pause and when to push.

What I underestimated was how hard it would be to find that voice.

The Assumption

It was early 2025. AI was everywhere. Text-to-speech had been a solved problem for decades, and generative audio was having its moment. I assumed that finding a high-quality instructional voice for a yoga app would be the least of my worries.

I was wrong.

The challenge wasn't technical, exactly. It was qualitative. A yoga instruction voice isn't just readable text — it's performance. It carries tempo. It breathes. The word inhale needs to feel like an invitation, not a command. Hold here needs weight behind it. The difference between a voice that guides and a voice that irritates is razor-thin, and most TTS systems sit firmly on the wrong side of that line.

Starting with Apple

Since my app was on iOS, my first instinct was to start with what Apple provides. macOS and iOS ship with a range of system TTS voices, and for a solo founder watching every dollar, free is a very compelling argument.

I tried them all. Every voice, every setting.

Apple TTS voice for narrating yoga pose

They were, without exception, robotic in a way that would have actively harmed the user experience. There's a specific kind of flatness to even the best system TTS — a uniformity of rhythm that strips all meaning from the words. A voice that says warrior two with the same cadence as corpse pose isn't guiding anyone. It's just reading.

When Apple released iOS 26 and its enhanced TTS voices, I tested again, hoping. The improvement was marginal. Even their best voices couldn't hold the quality bar I needed for a practice that asks users to close their eyes and follow sound alone.

There's also a practical constraint worth noting: not all Apple TTS voices are available for use within iOS apps. Some of the more capable voices are locked to system-level use. So even if a voice passed the quality test, it might not pass the implementation test.

The middle ground

I worked through several alternatives. PlayAI was one of them — capable in its own right, but still not landing where I needed it for something as intimate as a yoga session. The voices felt designed for radio or podcasts. Not for the mat.

The criteria I kept coming back to weren't complicated, but they were unforgiving:

Warmth. The voice had to feel like a person who had actually practiced yoga, not a system that had read about it.

Pacing. Pauses are as important as words in yoga instruction. TTS that rushes through cues — or worse, pauses mechanically — breaks the practice immediately.

Emphasis. When the voice says breathe, it needs to mean it. Flat intonation on instructional language is worse than silence.

Most tools failed on at least one of these. Some failed on all three.

Settling on ElevenLabs

After significant testing, ElevenLabs became the clear answer. Their voice quality — particularly for calm, instructive voices — is genuinely best-in-class. The speech rhythm feels human in a way that other systems still don't. The pauses land. The tone holds.

Eleven labs TTS audio file

It isn't cheap. ElevenLabs sits at a meaningful cost compared to system TTS or many alternatives. For a bootstrapped solo founder, that's a real line item, not a rounding error. But the alternative — shipping a product built around audio with a voice that undermines the experience — wasn't a trade-off I was willing to make. The voice is the product. Compromising it would have compromised everything.

ElevenLabs also became the tool I eventually used to generate the background music layered into the flows. Having both audio elements come from the same platform gave me more consistency in the overall sound design, which mattered for cohesion.

The architecture shift nobody talks about

Choosing ElevenLabs didn't just change the audio quality. It changed the entire technical architecture of the app.

With Apple's TTS, the implementation is almost trivially simple. You call the AVSpeechSynthesizer API, pass it a string, and the system speaks it. No files, no storage, no pipeline. It's stateless and cheap to run. The tradeoff for that simplicity is quality — and as I covered above, that tradeoff wasn't acceptable for Yogakosh.

Switching to ElevenLabs meant abandoning on-demand synthesis entirely and building a pre-generation pipeline from scratch. Here's what that actually required:

Pose-level audio structure. Every pose in Yogakosh's library needed its own set of audio cues — not just a single instruction, but multiple: a default cue , variation cues, and where relevant, separate left-side and right-side instructions. A pose like Low Lunge has different language for leading with the right leg versus the left. That distinction had to be codified into the data model before a single audio file could be generated.

Batch generation pipeline. Rather than generating audio on demand per user request — which would have been prohibitively expensive at ElevenLabs' per-character pricing — I built an offline pipeline to generate all cue audio files upfront. Each cue text gets sent to the ElevenLabs API, the resulting audio file is stored, and a reference is written back to the pose data. This runs as a batch process whenever new poses or cues are added to the library.

Storage and referencing. The generated .mp3 files needed a home. That meant setting up cloud storage, a consistent naming convention, and a reference system so the app could reliably retrieve the right audio file for the right cue at the right moment in a flow. None of this infrastructure existed before. The app previously had no concept of an audio asset library.

Flow assembly at runtime. When a user starts a flow, the app sequences the correct cue files for each pose in order, handles timing, and layers the background music underneath — all without any live API calls. The entire audio experience is assembled from pre-generated assets.

I had considered keeping ElevenLabs in the loop at runtime — generating cues on demand as the user moves through a flow. The quality would be identical, and it would have simplified the pipeline significantly. But the cost math didn't work. Pre-generating and caching the audio library was the only way to make the unit economics viable at any meaningful user volume.

The result is a leaner runtime experience, but a heavier content infrastructure. It's a different kind of complexity — less about what happens when a user presses play, and more about the system that makes that moment possible.

What I'd tell someone else building an audio-first product

1. The qualitative work is real work. Tone, feel, sound — these don't show up in your sprint board, but they take just as long to get right. Budget time for them the same way you budget time for features.

2. Test voice quality early, not at the end. Voice selection sounds like a late-stage polish decision. It isn't. Your choice determines your entire audio architecture. Get it wrong early and you're not just re-recording — you're rebuilding.

3. On-demand generation sounds appealing. Do the cost math first. Runtime synthesis is simpler to build. At any meaningful scale, it's also significantly more expensive. Pre-generation and caching is more infrastructure upfront, but it's the only approach that holds up economically.

4. The voice is the product. If your app is audio-first, a mediocre voice doesn't just hurt the experience — it contradicts your entire value proposition. Spend the money. Find the right one.